Please how can I count the list of numbers in a file
awk '{for(i=1;i<=NF;i++){if($i>=0 && $i<=25){print $i}}}'
Using the command above I can display the range of numbers on the terminal but if there are so many it will be difficult to count them. Please how can I show the count of the numbers on the terminal for example
1-20,
2-22,
3-23,
4-24,
etc
I know I can use wc but I don't know how to infuse it into the command above
awk '
{ for(i=1;i<=NF;i++) if (0<=$i && $i<=25) cnts[$i]++ }
END { for (n in cnts) print n, cnts[n] }
' file
Pipe the output to sort -n and uniq -c
awk '{for(i=1;i<=NF;i++){if($i>=0 && $i<=25){print $i}}}' filename | sort -n | uniq -c
You need to sort first because uniq requires all the same elements to be consecutive.
While I'm personally an awk fan, you might be glad to learn about grep -o functionality. I'm using grep -o to match all numbers in the file, and then awk can be used to pick all the numbers between 0 and 25 (inclusive). Last, we can use sort and uniq to count the results.
grep -o "[0-9][0-9]*" file | awk ' $1 >= 0 && $1 <= 25 ' | sort -n | uniq -c
Of course, you could do the counting in awk with an associative array as Ed Morton suggests:
egrep -o "\d+" file | awk ' $1 >= 0 && $1 <= 25 ' | awk '{cnt[$1]++} END { for (i in cnt) printf("%s-%s\n", i,cnt[i] ) } '
I modified Ed's code (typically not a good idea - I've been reading his code for years now) to show a modular approach - an awk script for filtering numbers in the range of 0 and 25 and another awk script for counting a list (of anything).
I also provided another subtle difference from my first script with egrep instead of grep.
To be honest, the second awk script generates some unexpected output, but I wanted to share an example of a more general approach. EDIT: I applied Ed's suggestion to correct the unexpected output - it's fine now.
Related
I have a list file, which has id and number and am trying to get those lines from a master file which do not have those ids.
List file
nw_66 17296
nw_67 21414
nw_68 21372
nw_69 27387
nw_70 15830
nw_71 32348
nw_72 21925
nw_73 20363
master file
nw_1 5896
nw_2 52814
nw_3 14537
nw_4 87323
nw_5 56466
......
......
nw_n xxxxx
so far am trying this but not working as expected.
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
Kindly help
Give this awk one-liner a try:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Maybe this helps:
awk 'NR == FNR {id[$1]=1;next}
{
if (id[$1] == "") {
print $0
}
}' listfile masterfile
We accept 2 files as input above, first one is listfile, second is masterfile.
NR == FNR would be true while awk is going through listfile. In the associative array id[], all ids in listfile are made a key with value as 1.
When awk goes through masterfile, it only prints a line if $1 i.e. the id is not a key in array ids.
The OP attempted the following line:
for i in $(awk '{print $1}' list.txt); do grep -v -w $i master.txt; done;
This line will not work as for every entry $i, you print all entries in master.txt tat are not equivalent to "$i". As a consequence, you will end up with multiple copies of master.txt, each missing a single line.
Example:
$ for i in 1 2; do grep -v -w "$i" <(seq 1 3); done
2 \ copy of seq 1 3 without entry 1
3 /
1 \ copy of seq 1 3 without entry 2
3 /
Furthermore, the attempt reads the file master.txt multiple times. This is very inefficient.
The unix tool grep allows one the check multiple expressions stored in a file in a single go. This is done using the -f flag. Normally this looks like:
$ grep -f list.txt master.txt
The OP can use this now in the following way:
$ grep -vwf <(awk '{print $1}' list.txt) master.txt
But this would do matches over the full line.
The awk solution presented by Kent is more flexible and allows the OP to define a more tuned match:
awk 'NR==FNR{a[$1]=1;next}!a[$1]' list master
Here the OP clearly states, I want to match column 1 of list with column 1 of master and I don't care about spaces or whatever is in column 2. The grep solution could still match entries in column 2.
I would like to know the count of unique values in column using linux commands. The column has values like below (data is edited from previous ones). I need to ignore .M, .Q and .A at the end and just count the unique number of plants
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.M"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.Q"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL.A"
"series_id":"ELEC.PLANT.CONS_EG_BTU.56841-WND-WT.Q"
"series_id":"ELEC.CONS_TOT.COW-GA-2.M"
"series_id":"ELEC.CONS_TOT.COW-GA-94.M"
I've tried this code but I'm not able to avoid those suffix
cat ELEC.txt | grep 'series_id' | cut -d, -f1 | wc -l
For above sample, expected count should be 6 but I get 8
This should do the job:
grep -Po "ELEC.PLANT.*" FILE | cut -d. -f -4 | sort | uniq -c
You first grep for the "ELEC.PLANT." part
remove the .Q,A,M
remove duplicates and count using sort | uniq -c
EDIT:
for the new data it should be only necessary to do the following:
grep -Po "ELEC.*" FILE | cut -d. -f -4 | sort | uniq -c
When you have to do some counting, you can easily do it with awk. Awk is an extremely versatile tool and I strongly recommend you to have a look at it. Maybe start with Awk one-liners explained.
Having that said, you can easily do some conditioned counting here:
What you want, is to count all unique lines which have series_id in it.
awk '/series_id/ && (! $0 in a) { c++; a[$0] } END {print c}'
This essentially states: if my line contains "series_id" and I did not store the line in my array a, then it means I did not encounter my line yet and increase the counter c with 1. At the END of the program, I print the count c.
Now you want to clean things up a bit. Your lines of interest essentially look like
"something":"something else"
So we are interested in something else which is in the 4th field if " is a field separator, and we are only interested in that if something is series_id located in field 2.
awk -F'"' '($2=="series_id") && (! $4 in a ) { c++; a[$4] } END {print c}'
Finally, you don't care about the last letter of the fourth field, so we need to make a small substitution:
awk -F'"' '($2=="series_id") { str=$4; gsub(/.$/,"",str); if (! str in a) {c++; a[str] } } END {print c}'
You could also rewrite this differently as:
awk -F'"' '($2 != "series_id" ) { next }
{ str=$4; gsub(/.$/,"",str) }
( str in a ) { next }
{ c++; a[str] }
END { print c }'
My standard way to count unique values is making sure I have the list of values (using grep and cut in your case), and add the following commands behind a pipe:
| sort -n | uniq -c
The sort does the sorting, based on number sorting, while the uniq gets the unique entries (the -c stands for "count").
Do this : cat ELEC.txt | grep 'series_id' | cut -f1-4 -d. | uniq | wc -l
-f1-4 will remove the the fourth . from each line
Here is a possible solution using awk:
awk 'BEGIN{FS="[:.\"]+"} /^"series_id":/{print $6}' \
ELEC.txt |sort -n |uniq -c
The ouput for the sample you posted will be something like this:
1 56841-WND-WT
2 56855-ALL-ALL
1 56855-WND-ALL
2 56868-LFG-ALL
If you need the entire string, you can print the other fields as well:
awk 'BEGIN{FS="[:.\"]+"; OFS="."} /^"series_id":/{print $3,$4,$5,$6}' \
ELEC.txt |sort -n | uniq -c
And the output will be something like this:
1 ELEC.PLANT.CONS_EG_BTU.56841-WND-WT
2 ELEC.PLANT.CONS_EG_BTU.56855-ALL-ALL
1 ELEC.PLANT.CONS_EG_BTU.56855-WND-ALL
2 ELEC.PLANT.CONS_EG_BTU.56868-LFG-ALL
I have written a shell script to get any column.
The script is:
#!/bin/sh
awk '{print $c}' c=${1:-1}
So that i can call it as
ls -l | column 2
But how do i implement it for multiple columns?
Say, it i want something like :
ls -l | column 2 3
In this case, I wouldn't use awk at all:
columns() { tr -s '[:blank:]' ' ' | cut -d ' ' -f "$#"; }
This uses tr to squeeze all sequences of whitespace to a single space, then cut to extract the fields you're interested in.
ls -l | columns 1,5,9-
Note, you shouldn't parse the output of ls.
awk -v c="3,5" '
BEGIN{ split(c,a,/[^[:digit:]]+/) }
{ for(i=1;i in a;i++) printf "%s%s", (i==1?"":OFS), $(a[i]); print "" }
'
Use any non-digit(s) you like as the column number separator (e.g. comma or space).
ls -l | nawk -v r="3,5" 'BEGIN{split(r,a,",")}{for(i in a)printf $a[i]" ";print "\n"}'
Now you can simply change your r variable in shell script and pass it on.
or you can configure it in your shell script and use r=$your_list_var
What ever fields numbers are present in $your_list_var will be printed by awk command.
The example above print 3rd and 5th fields of ls -l output.
I am trying to make a a simple script of finding the largest word and its number/length in a text file using bash. I know when I use awk its simple and straight forward but I want to try and use this method...lets say I know if a=wmememememe and if I want to find the length I can use echo {#a} its word I would echo ${a}. But I want to apply it on this below
for i in `cat so.txt` do
Where so.txt contains words, I hope it makes sense.
bash one liner.
sed 's/ /\n/g' YOUR_FILENAME | sort | uniq | awk '{print length, $0}' | sort -nr | head -n 1
read file and split the words (via sed)
remove duplicates (via sort | uniq)
prefix each word with it's length (awk)
sort the list by the word length
print the single word with greatest length.
yes this will be slower than some of the above solutions, but it also doesn't require remembering the semantics of bash for loops.
Normally, you'd want to use a while read loop instead of for i in $(cat), but since you want all the words to be split, in this case it would work out OK.
#!/bin/bash
longest=0
for word in $(<so.txt)
do
len=${#word}
if (( len > longest ))
then
longest=$len
longword=$word
fi
done
printf 'The longest word is %s and its length is %d.\n' "$longword" "$longest"
Another solution:
for item in $(cat "$infile"); do
length[${#item}]=$item # use word length as index
done
maxword=${length[#]: -1} # select last array element
printf "longest word '%s', length %d" ${maxword} ${#maxword}
longest=""
for word in $(cat so.txt); do
if [ ${#word} -gt ${#longest} ]; then
longest=$word
fi
done
echo $longest
awk script:
#!/usr/bin/awk -f
# Initialize two variables
BEGIN {
maxlength=0;
maxword=0
}
# Loop through each word on the line
{
for(i=1;i<=NF;i++)
# Assign the maxlength variable if length of word found is greater. Also, assign
# the word to maxword variable.
if (length($i)>maxlength)
{
maxlength=length($i);
maxword=$i;
}
}
# Print out the maxword and the maxlength
END {
print maxword,maxlength;
}
Textfile:
[jaypal:~/Temp] cat textfile
AWK utility is a data_extraction and reporting tool that uses a data-driven scripting language
consisting of a set of actions to be taken against textual data (either in files or data streams)
for the purpose of producing formatted reports.
The language used by awk extensively uses the string datatype,
associative arrays (that is, arrays indexed by key strings), and regular expressions.
Test:
[jaypal:~/Temp] ./script.awk textfile
data_extraction 15
Relatively speedy bash function using no external utils:
# Usage: longcount < textfile
longcount ()
{
declare -a c;
while read x; do
c[${#x}]="$x";
done;
echo ${#c[#]} "${c[${#c[#]}]}"
}
Example:
longcount < /usr/share/dict/words
Output:
23 electroencephalograph's
'Modified POSIX shell version of jimis' xargs-based
answer; still very slow, takes two or three minutes:
tr "'" '_' < /usr/share/dict/words |
xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} "$1"' |
sort -n | tail | tr '_' "'"
Note the leading and trailing tr bit to get around GNU xargs
difficulty with single quotes.
for i in $(cat so.txt); do echo ${#i}; done | paste - so.txt | sort -n | tail -1
Slow because of the gazillion of forks, but pure shell, does not require awk or special bash features:
$ cat /usr/share/dict/words | \
xargs -n1 -I '{}' -d '\n' sh -c 'echo `echo -n "{}" | wc -c` "{}"' | \
sort -n | tail
23 Pseudolamellibranchiata
23 pseudolamellibranchiate
23 scientificogeographical
23 thymolsulphonephthalein
23 transubstantiationalist
24 formaldehydesulphoxylate
24 pathologicopsychological
24 scientificophilosophical
24 tetraiodophenolphthalein
24 thyroparathyroidectomize
You can easily parallelize, e.g. to 4 CPUs by providing -P4 to xargs.
EDIT: modified to work with the single quotes that some dictionaries have. Now it requires GNU xargs because of -d argument.
EDIT2: for the fun of it, here is another version that handles all kinds of special characters, but requires the -0 option to xargs. I also added -P4 to compute on 4 cores:
cat /usr/share/dict/words | tr '\n' '\0' | \
xargs -0 -I {} -n1 -P4 sh -c 'echo ${#1} "$1"' wordcount {} | \
sort -n | tail
I am able to find the number of times a word occurs in a text file, like in Linux we can use:
cat filename|grep -c tom
My question is, how do I find the count of multiple words like "tom" and "joe" in a text file.
Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
test.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! Hope this helps
Ok, so first split the file into words, then sort and uniq:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use uniq:
sort filename | uniq -c
Use awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input.
Pipe tho output to grep to get the desired fields
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-)
The sample you gave does not search for words "tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
\<\(tom\|joe\)\>
You could do regexp,
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
Here is one:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE
A shell script solution:
#!/bin/bash
file_name="$2"
string="$1"
if [ $# -ne 2 ]
then
echo "Usage: $0 <pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo
I completely forgot about grep -f:
cat filename | grep -fc names
AWK solution:
Assuming the names are in a file called names:
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
Note that your original grep doesn't search for words. e.g.
$ echo tomorrow | grep -c tom
1
You need grep -w
gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.
We use gawk because the POSIX awk doesn't allow regex record separator.
For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print}.")
To find all hits in all lines
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
This will count "tomtom" as 2 hits.