grep -o . (some File) | sort -f| uniq -ic
i want to know if there is a way to make the display of this look better.
what this command displays
1 a
1 b
2 c
what i want it to look like
1 = a
1 = b
Use awk like this:
grep -o . (some File) | sort -f| uniq -ic | awk '{$1=$1." ="; print $0}
Related
how can I compute the following from within the Unix terminal and then store the results in a file?
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115, 2
4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss, 1
49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss, 1
EDIT:
OK, I think, I got it:
cat lol | cut -f 1,2 -d ',' | sort | uniq -c > lol2
My only problem now it is that the fist column of the output file should - in fact - be at the end, and also that the output file should be csv compatible. Any ideas?
Would it be a problem to simply count unique lines instead? If not, the uniq command is your friend - see its manpage, but be sure to sort the list first so that all repetitions happen after another:
sort myfile.txt | uniq -c
For your example data, returns:
2 4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b91302add63f7b115
1 4F8D-AA87-D9EC8805DFDA,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
1 49FB-A855-3EED46E0BF2E,3a58538d510c66b98ad7bb3cb9768de08e1ae30b9130dsasdadsadss
To redirect into a file, append > outfile.txt:
sort myfile.txt | uniq -c > outfile.txt
If you need an output similar to the one in your question, you can use awk to reorder columns and sed to change delimiters:
sort count.txt | uniq -c | awk '{ print $2 " " $1 }' | sed 's/ /,/'
Given a .txt files with space separated words such as:
But where is Esope the holly Bastard
But where is
And the Awk function :
cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2"#"$1}'
I get the following output in my console :
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
How to get into printed into myFile.txt ?
I actually have 300.000 lines and near 2 millions words. Better to output the result into a file.
EDIT: Used answer (by #Sudo_O):
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt
Your pipeline isn't very efficient you should do the whole thing in awk instead:
awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file > myfile
If you want the output in sorted order:
awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort > myfile
The actual output given by your pipeline is:
$ tr ' ' '\n' < file | sort | uniq -c | awk '{print $2"#"$1}'
Bastard#1
But#2
Esope#1
holly#1
is#2
the#1
where#2
Note: using cat is useless here we can just redirect the input with <. The awk script doesn't make sense either, it's just reversing the order of the words and words frequency and separating them with an #. If we drop the awk script the output is closer to the desired output (notice the preceding spacing however and it's unsorted):
$ tr ' ' '\n' < file | sort | uniq -c
1 Bastard
2 But
1 Esope
1 holly
2 is
1 the
2 where
We could sort again a remove the leading spaces with sed:
$ tr ' ' '\n' < file | sort | uniq -c | sort | sed 's/^\s*//'
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
But like I mention at the start let awk handle it:
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
Just redirect output to a file.
cat /pathway/to/your/file.txt % tr ' ' '\n' | sort | uniq -c | \
awk '{print $2"#"$1}' > myFile.txt
Just use shell redirection :
echo "test" > overwrite-file.txt
echo "test" >> append-to-file.txt
Tips
A useful command is tee which allow to redirect to a file and still see the output :
echo "test" | tee overwrite-file.txt
echo "test" | tee -a append-file.txt
Sorting and locale
I see you are working with asian script, you need to be need to be careful with the locale use by your system, as the resulting sort might not be what you expect :
* WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
And have a look at the output of :
locale
I'd like to count number of xml nodes in my xml file(grep or somehow).
....
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
...
<countryCode>CAN</countryCode>
<someNode>USA</someNode>
<countryCode>CAN</countryCode>
<someNode>Otherone</someNode>
<countryCode>GBR</countryCode>
...
How to get count of individual countries like CAN = 3, USA = 1, GBR = 2? Without passing in the names of the countries there might be some more countries?
Update:
There are other nodes beside countrycode
My simple suggestion would be to use sort and uniq -c
$ echo '<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>' | sort | uniq -c
3 <countryCode>CAN</countryCode>
2 <countryCode>GBR</countryCode>
1 <countryCode>USA</countryCode>
Where you'd pipe in the output of your grep instead of an echo. A more robust solution would be to use XPath. If youre XML file looks like
<countries>
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>
</countries>
Then you could use:
$ xpath -q -e '/countries/countryCode/text()' countries.xml | sort | uniq -c
3 CAN
2 GBR
1 USA
I say it's more robust because using tools designed for parsing flat text will be inherently flaky for dealing with XML. Depending on the context of the original XML file, a different XPath query might work better, which would match them anywhere:
$ xpath -q -e '//countryCode/text()' countries.xml | sort | uniq -c
3 CAN
2 GBR
1 USA
grep can give a total count, but it doesn't do a per-pattern; for that you should use uniq -c:
$ uniq -c <(sort file)
1
1
3 <countryCode>CAN</countryCode>
2 <countryCode>GBR</countryCode>
1 <countryCode>USA</countryCode>
If you want to get rid of the empty lines and tags, add sed:
$ sed -e '/^[[:space:]]*$/d' -e 's/<.*>\([A-Z]*\)<.*>/\1/g' test | sort | uniq -c
3 CAN
2 GBR
1 USA
To delete lines that don't have a country code, add another command to sed:
$ sed -e '/countryCode/!d' -e '/^[[:space:]]*$/d' -e 's/<.*>\([A-Z]*\)<.*>/\1/g' test | sort | uniq -c
3 CAN
2 GBR
1 USA
quick and dirty (only based on your example text):
awk -F'>|<' '{a[$3]++;}END{for(x in a)print x,a[x]}' file
test:
kent$ cat t.txt
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>
kent$ awk -F'>|<' '{a[$3]++;}END{for(x in a)print x,a[x]}' t.txt
USA 1
GBR 2
CAN 3
sed -n "s/<countryCode>\(.*\)<\/countryCode>/\1/p"|sort|uniq -c
cat dummy | sort |cut -c14-16 | sort |tail -6 |awk '{col[$1]++} END {for (i in col) print i, col[i]}'
Dummy is ur file name and replace 6 in -6 with n-2(n - no of lines in ur data file)
Something like this maybe:
grep -e 'regex' file.xml | sort | uniq -c
Of course you need to provide regex that matches your needs.
If your file is set up as you had shown to us, awk can do it like:
awk -F '<\/?countryCode>' '{ a[$2]++} END { for (e in a) { printf("%s\t%i\n",e,a[e]) }' INPUTFILE
If there are more than one <countryCode> tag on a line, you can still set up some pipe to make it into one line, e.g.:
sed 's/<countryCode>/\n<countryCode>/g' INPUTFILE | awk ...
Note if the <countryCode> spans to multiple lines, it does not work as expected.
Anyway, I'd recommend to use xpath for this kind of task (perl's xml::xpath module has a CLI utility for this.
Quick and simple:
grep countryCode ./file.xml | sort | uniq -c
I have to compare md5sums of 80 copies of same file with each other and report a failure on a mismatch. How do I do it effectively in bash? I am looking for an elegant algorithm to do it.
md5sum FILES | sed 's/ .*$//' | sort -u
If you get more than one line of output, you have a mismatch.
(This doesn't tell you where the mismatch is.)
Putting it together, and replacing the sed command with a somewhat less terse awk command:
count=$(md5sum "$#" | awk '{print $1}' | sort -u | wc -l)
if [ $count -eq 1 ] ; then
echo "Everything matches"
else
echo "Nope"
fi
The output of:
md5sum $files | sort -k 1,2
is a list of the checksums in sorted order, with the corresponding file names afterwards. If you need to eyeball the results, this might be sufficient. If you need to identify odd-ball results, you have to decide on the presentation. You say you've got 80 copies of 'the same file'. Suppose there are actually 10 copies of each of 8 versions of 'the file'. How are you going to decide which is correct and which is bogus? What if you have 41 with one hash and 39 with another - are you sure the 39 are wrong and the 41 correct? Clearly, it is likely that one hash will predominate, but you'll have to worry about those pesky boundary conditions.
You can also do fancier things, such as:
md5sum $files | sort -k 1,2 > sorted.md5
sed 's/ .*//' sorted.md5 | uniq -c | sed 's/^ *\([0-9][0-9]*\) \(.*\)/\2 \1/' > counted.md5
join -j 1 -o 1.1,2.2,1.2 sorted.md5 counted.md5
This gives you an output consisting of the MD5 checksum, repetition count, and file name. The first sed script could be replaced by awk '{print $1}' if you prefer. The second would be replaced by awk '{printf "%s %s\n", $2, $1}', which is probably clearer (and is shorter). The reason for that futzing around is to get rid of the leading spaces in the output of uniq -c which confuse join.
md5sum $files | sort -k 1,2 > sorted.md5
awk '{print $1}' sorted.md5 | uniq -c | awk '{printf "%s %s\n", $2, $1}' > counted.md5
join -j 1 -o 1.1,2.2,1.2 sorted.md5 counted.md5
I created some files x1.h, x2.h and x3.h by copying dbatools.h, and set files=$(ls *.h). The output was:
0763af91756ef24f3d8f61131eb8f8f2 1 dblbac.h
10215826449a3e0f967a4c436923cffa 1 dbatool.h
37f48869409c2b0554d83bd86034c9bf 4 dbatools.h
37f48869409c2b0554d83bd86034c9bf 4 x1.h
37f48869409c2b0554d83bd86034c9bf 4 x2.h
37f48869409c2b0554d83bd86034c9bf 4 x3.h
5a48695c6b8673373d30f779ccd3a3c2 1 dbxglob.h
7b22f7e2373422864841ae880aad056d 1 dbstringlist.h
a5b8b19715f99c7998c4519cd67f0230 1 dbimglob.h
f9ef785a2340c7903b8e1ae4386df211 1 dbmach11.h
This can be further processed as necessary (for example, with sort -k2,3nr to get the counts in decreasing order, so the deviant files appear last). You have the names of the duplicate files grouped together along with a count telling you how many there are each duplication. What you do next depends on you.
A real production script would use temporary file names instead of hard-coded names, of course, and would clean up after itself.
md5sum FILES > MD5SUMS.md5
cut -c1-32 < MD5SUMS.md5 | sort | uniq -c | sort -n
will return something like this:
1 485fd876eef8e941fcd6fc19643e5e59
1 585fd876eef8e941fcd6fc19643e5e59
5 385fd876eef8e941fcd6fc19643e5e59
Reading: 5 files have the same checksum, two other have "individual" checksums. I assume, that the majority is right, so an additional
| tail -1 | cut -c 9-
returns the checksum of the last line. Now filter everything else (and put the parts together):
md5sum FILES > MD5SUMS.md5
grep -v "$(cut -c1-32 < MD5SUMS.md5 | sort | uniq -c | sort -n | tail -1 | cut -c 9-)" MD5SUMS.md5 | cut -c35-
This will print the filenames of the non-majority files.
I can do this in python but I was wondering if I could do this in Linux
I have a file like this
name1 text text 123432re text
name2 text text 12344qp text
name3 text text 134234ts text
I want to find all the different types of values in the 3rd column by a particular username lets say name 1.
grep name1 filename gives me all the lines, but there must be some way to just list all the different type of values? (I don't want to display duplicate values for the same username)
grep name1 filename | cut -d ' ' -f 4 | sort -u
This will find all lines that have name1, then get just the fourth column of data and show only unique values.
I tried using cat
File contains :(here file is foo.sh you can input any file name here)
$cat foo.sh
tar
world
class
zip
zip
zip
python
jin
jin
doo
doo
uniq will get each word only once
$ cat foo.sh | sort | uniq
class
doo
jin
python
tar
world
zip
uniq -u will get the word appeared only one time in file
$ cat foo.sh | sort | uniq -u
class
python
tar
world
uniq -d will get the only the duplicate words and print them only once
$ cat foo.sh | sort | uniq -d
doo
jin
zip
You can let sort look only on 4-th key, and then ask only for records with unique keys:
grep name1 | sort -k4 -u
As an all-in-one awk solution:
awk '$1 == "name1" && ! seen[$1" "$4]++ {print $4}' filename
IMHO Michał Šrajer got the best answer but a filename needed after grep name1
And i've got this fancy solution using indexed array
user=name1
IFSOLD=$IFS; IFS=$'\n'; test=( $(grep $user test) ); IFS=$IFSOLD
declare -A index
for item in "${test[#]}"; {
sub=( $item )
name=${sub[3]}
index[$name]=$item
}
for item in "${index[#]}"; { echo $item; }
In my opinion, you need to select the field from which you need the unique values. I was trying to retrieve unique source IPs from IPTables log.
cat /var/log/iptables.log | grep "May 5" | awk '{print $11}' | sort -u
Here is the output of the above command:
SRC=192.168.10.225
SRC=192.168.10.29
SRC=192.168.20.125
SRC=192.168.20.147
SRC=192.168.20.155
SRC=192.168.20.183
SRC=192.168.20.194
So, the best idea is to select the field first and then filter out the unique data.
The following command worked for me.
sudo cat AirtelFeb.txt | awk '{print $3}' | sort -u
Here it prints the 3rd column with unique values.
I think you meant fourth column.
You can try using 'cat Filename.txt | awk '{print $4}' | sort | uniq'