Grep in Unix + like in SQL + groupby in SQL - linux

What I have in the file is
112}2014-03-02}ABC
112}2014-02-02}ABC
112}}ABC
112}2014-03-02}ABC
112}2013-03-02}ABC
112}2013-02-02}ABC
112}2013-03-02}ABC
I need to grep and see something like this as a output and looking for a one liner (grep preferred)
2014-03-02 2
2014-02-02 1
1
2013-03-02 2
2013-02-02 1
I found this question but it does not serve my purpose.
I also tried something like these but no luck
grep -ioh '2014' file.txt | sort | uniq -c
grep -ioh '2013' file.txt | sort | uniq -c

This awk may do:
awk -F\} '{a[$2]++} END {for (i in a) printf "%-10s %s\n",i,a[i]}' file
1
2013-02-02 1
2014-02-02 1
2014-03-02 2
2013-03-02 2

Here how you could do it with egrep:
egrep -o '[0-9]{4}-[0-9]{2}-[0-9]{2}' file.txt | sort | uniq -c
Output:
1 2013-02-02
2 2013-03-02
1 2014-02-02
2 2014-03-02
Edit
To also count lines without a date on them, you can approach it similarly to #Jotnes answer:
grep -o '}[^}]*}' | tr -d '}' | sort | uniq -c
Output:
1
1 2013-02-02
2 2013-03-02
1 2014-02-02
2 2014-03-02

Related

Linux command to retrieve unique words and count along with punctuation marks

tr -c '[:alnum:]' '[\n*]' < 4300-0.txt | sort | uniq -c | sort -nr | head
The following command retrieves unique words along with the count. I'd like to retrieve punctuation marks along with the unique word counts.
What is the way to achieve this?
You could split your input with tee and extract punctuations and alnum separately.
echo "Helo, world!" |
{
tee >(tr -c '[:alnum:]' '\n' >&3) |
tr -c '[:punct:]' '\n'
} 3>&1 |
sed '/^$/d' |
sort | uniq -c | sort -nr | head
should output:
1 world
1 Helo
1 !
1 ,
A short sed script also seems to work:
echo "Helo, world!
OK!" |
sed '
s/\([[:alnum:]]\+\)\([^[:alnum:]]\)/\1\n\2/g
s/\([[:punct:]]\+\)\([^[:punct:]]\)/\1\n\2/g
s/[^[:punct:][:alnum:]]/\n/g
' |
sed '/^$/d' |
sort | uniq -c | sort -nr | head
should output:
2 !
1 world
1 OK
1 Helo
1 ,
You can use [:punct:] to retrieve the punctuation marks
And you can run:
tr -c '[:alnum:][:punct:]' '[\n*]' < 4300-0.txt | sort | uniq -c | sort -nr | head
it will print out the punctuation marks as well.
For example:
if you have in your txt file
aaa,
aaa
the output will be:
1 aaa
1 aaa,

How can I pipe multiple arguments into a unix command?

I couldn't find anything similar hence posting the question.
Say I have set of commands which provides an output
First set which produces an output
cat xyx.txt | awk '{.........}' | sed 's/.../' | cut -d....
Second set which produces an output
cat abc.txt | awk '{.........}' | cut -d ... | sed 's/...../'
I want the output of these as the 2 parameters to the "join" command. I know I can redirect these two a file and then use the join command with the files as the arguments.
Basically, can the whole thing be done in a single line.... something like
[first set of commands] > join -1 1 -2 1 < [second set of commands]
If you are using bash, which is the default shell in many Linux distributions, the expression:
join <([first set of commands]) <([second set of commands])
is valid.
So
join <(cat xyx.txt | awk '{.........}' | sed 's/.../' | cut -d....) <(cat abc.txt | awk '{.........}' | cut -d ... | sed 's/...../')
should make it.
Basic example
$ cat a
1 hello
2 bye
3 ciao
$ cat b
1 hello 123
2 bye 456
3 adios 789
$ cut -d' ' -f1,2 b | awk '{print $1, 2, $2}'
1 2 hello
2 2 bye
3 2 adios
$ join <(cut -d' ' -f1,2 b | awk '{print $1, 2, $2}') a
1 2 hello hello
2 2 bye bye
3 2 adios ciao

Trying to get file with max lines to print with lines number

So i've been goofing with this since last night and I can get a lot of things to happen just not what I want.
I need a code to find the file with the most lines in a directory and then print the name of the file and the number of lines that file has.
I can get the entire directory's lines to print but can't seem to narrow the field so to speak.
Any help for a fool of a learner?
wc -l $1/* 2>/dev/null
| grep -v ' total$'
| sort -n -k1
| tail -1l
After some pro help in another question, this is where I got to, but it returns them all, and doesn't print their line counts.
Following awk command should do the job for you and you can avoid all redundant piped commands:
wc -l $1/* | awk '$2 != "total"{if($1>max){max=$1;fn=$2}} END{print max, fn}'
UPDATE: To avoid last line of wc's output this might be better awk command:
wc -l $1/* | awk '{arr[cnt++]=$0} END {for (i=0; i<length(arr)-1; i++)
{split(arr[i], a, " "); if(a[1]>max) {max=a[1]; fn=a[2]}} print max, fn}'
you can try:
wc -l $1/* | grep -v total | sort -g | tail -1
actually to avoid the grep that would also remove files containing "total":
for f in $1/*; do wc -l $f; done | sort -g | tail -1
or even better, as suggested in comments:
wc -l $1/* | sort -rg | sed -n '2p'
you can even make it a function:
function get_biggest_file() {
wc -l $* | sort -rg | sed -n '2p'
}
% ls -l
... 0 Jun 12 17:33 a
... 0 Jun 12 17:33 b
... 0 Jun 12 17:33 c
... 0 Jun 12 17:33 d
... 25 Jun 12 17:33 total
% get_biggest_file ./*
5 total
EDIT2: using the function I gave, you can simply output what you need as follows:
get_biggest $1/* | awk '{print "The file \"" $2 "\" has the maximum number of lines: " $1}'
EDIT: if you tried to write the function as you've written it in the question, you should add line continuation character at the end, as follows, or your shell will think you're trying to issue 4 commands:
wc -l $1/* 2>/dev/null \
| grep -v ' total$' \
| sort -n -k1 \
| tail -1l

Awk: Words frequency from one text file, how to ouput into myFile.txt?

Given a .txt files with space separated words such as:
But where is Esope the holly Bastard
But where is
And the Awk function :
cat /pathway/to/your/file.txt | tr ' ' '\n' | sort | uniq -c | awk '{print $2"#"$1}'
I get the following output in my console :
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
How to get into printed into myFile.txt ?
I actually have 300.000 lines and near 2 millions words. Better to output the result into a file.
EDIT: Used answer (by #Sudo_O):
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" myfile.txt | sort > myfileout.txt
Your pipeline isn't very efficient you should do the whole thing in awk instead:
awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file > myfile
If you want the output in sorted order:
awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort > myfile
The actual output given by your pipeline is:
$ tr ' ' '\n' < file | sort | uniq -c | awk '{print $2"#"$1}'
Bastard#1
But#2
Esope#1
holly#1
is#2
the#1
where#2
Note: using cat is useless here we can just redirect the input with <. The awk script doesn't make sense either, it's just reversing the order of the words and words frequency and separating them with an #. If we drop the awk script the output is closer to the desired output (notice the preceding spacing however and it's unsorted):
$ tr ' ' '\n' < file | sort | uniq -c
1 Bastard
2 But
1 Esope
1 holly
2 is
1 the
2 where
We could sort again a remove the leading spaces with sed:
$ tr ' ' '\n' < file | sort | uniq -c | sort | sed 's/^\s*//'
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
But like I mention at the start let awk handle it:
$ awk '{a[$1]++}END{for(k in a)print a[k],k}' RS=" |\n" file | sort
1 Bastard
1 Esope
1 holly
1 the
2 But
2 is
2 where
Just redirect output to a file.
cat /pathway/to/your/file.txt % tr ' ' '\n' | sort | uniq -c | \
awk '{print $2"#"$1}' > myFile.txt
Just use shell redirection :
echo "test" > overwrite-file.txt
echo "test" >> append-to-file.txt
Tips
A useful command is tee which allow to redirect to a file and still see the output :
echo "test" | tee overwrite-file.txt
echo "test" | tee -a append-file.txt
Sorting and locale
I see you are working with asian script, you need to be need to be careful with the locale use by your system, as the resulting sort might not be what you expect :
* WARNING * The locale specified by the environment affects sort order. Set LC_ALL=C to get the traditional sort order that uses native byte values.
And have a look at the output of :
locale

Grep for multiple patterns in a file

I'd like to count number of xml nodes in my xml file(grep or somehow).
....
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
...
<countryCode>CAN</countryCode>
<someNode>USA</someNode>
<countryCode>CAN</countryCode>
<someNode>Otherone</someNode>
<countryCode>GBR</countryCode>
...
How to get count of individual countries like CAN = 3, USA = 1, GBR = 2? Without passing in the names of the countries there might be some more countries?
Update:
There are other nodes beside countrycode
My simple suggestion would be to use sort and uniq -c
$ echo '<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>' | sort | uniq -c
3 <countryCode>CAN</countryCode>
2 <countryCode>GBR</countryCode>
1 <countryCode>USA</countryCode>
Where you'd pipe in the output of your grep instead of an echo. A more robust solution would be to use XPath. If youre XML file looks like
<countries>
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>
</countries>
Then you could use:
$ xpath -q -e '/countries/countryCode/text()' countries.xml | sort | uniq -c
3 CAN
2 GBR
1 USA
I say it's more robust because using tools designed for parsing flat text will be inherently flaky for dealing with XML. Depending on the context of the original XML file, a different XPath query might work better, which would match them anywhere:
$ xpath -q -e '//countryCode/text()' countries.xml | sort | uniq -c
3 CAN
2 GBR
1 USA
grep can give a total count, but it doesn't do a per-pattern; for that you should use uniq -c:
$ uniq -c <(sort file)
1
1
3 <countryCode>CAN</countryCode>
2 <countryCode>GBR</countryCode>
1 <countryCode>USA</countryCode>
If you want to get rid of the empty lines and tags, add sed:
$ sed -e '/^[[:space:]]*$/d' -e 's/<.*>\([A-Z]*\)<.*>/\1/g' test | sort | uniq -c
3 CAN
2 GBR
1 USA
To delete lines that don't have a country code, add another command to sed:
$ sed -e '/countryCode/!d' -e '/^[[:space:]]*$/d' -e 's/<.*>\([A-Z]*\)<.*>/\1/g' test | sort | uniq -c
3 CAN
2 GBR
1 USA
quick and dirty (only based on your example text):
awk -F'>|<' '{a[$3]++;}END{for(x in a)print x,a[x]}' file
test:
kent$ cat t.txt
<countryCode>GBR</countryCode>
<countryCode>USA</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>CAN</countryCode>
<countryCode>GBR</countryCode>
kent$ awk -F'>|<' '{a[$3]++;}END{for(x in a)print x,a[x]}' t.txt
USA 1
GBR 2
CAN 3
sed -n "s/<countryCode>\(.*\)<\/countryCode>/\1/p"|sort|uniq -c
cat dummy | sort |cut -c14-16 | sort |tail -6 |awk '{col[$1]++} END {for (i in col) print i, col[i]}'
Dummy is ur file name and replace 6 in -6 with n-2(n - no of lines in ur data file)
Something like this maybe:
grep -e 'regex' file.xml | sort | uniq -c
Of course you need to provide regex that matches your needs.
If your file is set up as you had shown to us, awk can do it like:
awk -F '<\/?countryCode>' '{ a[$2]++} END { for (e in a) { printf("%s\t%i\n",e,a[e]) }' INPUTFILE
If there are more than one <countryCode> tag on a line, you can still set up some pipe to make it into one line, e.g.:
sed 's/<countryCode>/\n<countryCode>/g' INPUTFILE | awk ...
Note if the <countryCode> spans to multiple lines, it does not work as expected.
Anyway, I'd recommend to use xpath for this kind of task (perl's xml::xpath module has a CLI utility for this.
Quick and simple:
grep countryCode ./file.xml | sort | uniq -c

Resources