Finding the longest word in a text file - linux

I am trying to make a a simple script of finding the largest word and its number/length in a text file using bash. I know when I use awk its simple and straight forward but I want to try and use this method...lets say I know if a=wmememememe and if I want to find the length I can use echo {#a} its word I would echo ${a}. But I want to apply it on this below
for i in `cat so.txt` do
Where so.txt contains words, I hope it makes sense.

bash one liner.
sed 's/ /\n/g' YOUR_FILENAME | sort | uniq | awk '{print length, $0}' | sort -nr | head -n 1
read file and split the words (via sed)
remove duplicates (via sort | uniq)
prefix each word with it's length (awk)
sort the list by the word length
print the single word with greatest length.
yes this will be slower than some of the above solutions, but it also doesn't require remembering the semantics of bash for loops.

Normally, you'd want to use a while read loop instead of for i in $(cat), but since you want all the words to be split, in this case it would work out OK.
#!/bin/bash
longest=0
for word in $(<so.txt)
do
len=${#word}
if (( len > longest ))
then
longest=$len
longword=$word
fi
done
printf 'The longest word is %s and its length is %d.\n' "$longword" "$longest"

Another solution:
for item in $(cat "$infile"); do
length[${#item}]=$item # use word length as index
done
maxword=${length[#]: -1} # select last array element
printf "longest word '%s', length %d" ${maxword} ${#maxword}

longest=""
for word in $(cat so.txt); do
if [ ${#word} -gt ${#longest} ]; then
longest=$word
fi
done
echo $longest

awk script:
#!/usr/bin/awk -f
# Initialize two variables
BEGIN {
maxlength=0;
maxword=0
}
# Loop through each word on the line
{
for(i=1;i<=NF;i++)
# Assign the maxlength variable if length of word found is greater. Also, assign
# the word to maxword variable.
if (length($i)>maxlength)
{
maxlength=length($i);
maxword=$i;
}
}
# Print out the maxword and the maxlength
END {
print maxword,maxlength;
}
Textfile:
[jaypal:~/Temp] cat textfile
AWK utility is a data_extraction and reporting tool that uses a data-driven scripting language
consisting of a set of actions to be taken against textual data (either in files or data streams)
for the purpose of producing formatted reports.
The language used by awk extensively uses the string datatype,
associative arrays (that is, arrays indexed by key strings), and regular expressions.
Test:
[jaypal:~/Temp] ./script.awk textfile
data_extraction 15

Relatively speedy bash function using no external utils:
# Usage: longcount < textfile
longcount ()
{
declare -a c;
while read x; do
c[${#x}]="$x";
done;
echo ${#c[#]} "${c[${#c[#]}]}"
}
Example:
longcount < /usr/share/dict/words
Output:
23 electroencephalograph's
'Modified POSIX shell version of jimis' xargs-based
answer; still very slow, takes two or three minutes:
tr "'" '_' < /usr/share/dict/words |
xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} "$1"' |
sort -n | tail | tr '_' "'"
Note the leading and trailing tr bit to get around GNU xargs
difficulty with single quotes.

for i in $(cat so.txt); do echo ${#i}; done | paste - so.txt | sort -n | tail -1

Slow because of the gazillion of forks, but pure shell, does not require awk or special bash features:
$ cat /usr/share/dict/words | \
xargs -n1 -I '{}' -d '\n' sh -c 'echo `echo -n "{}" | wc -c` "{}"' | \
sort -n | tail
23 Pseudolamellibranchiata
23 pseudolamellibranchiate
23 scientificogeographical
23 thymolsulphonephthalein
23 transubstantiationalist
24 formaldehydesulphoxylate
24 pathologicopsychological
24 scientificophilosophical
24 tetraiodophenolphthalein
24 thyroparathyroidectomize
You can easily parallelize, e.g. to 4 CPUs by providing -P4 to xargs.
EDIT: modified to work with the single quotes that some dictionaries have. Now it requires GNU xargs because of -d argument.
EDIT2: for the fun of it, here is another version that handles all kinds of special characters, but requires the -0 option to xargs. I also added -P4 to compute on 4 cores:
cat /usr/share/dict/words | tr '\n' '\0' | \
xargs -0 -I {} -n1 -P4 sh -c 'echo ${#1} "$1"' wordcount {} | \
sort -n | tail

Related

How to find 10 most frequent words in the file in Unix/Linux

How to find 10 most frequent words in the file in Unix/Linux?
I tried using this command in Unix:
sort file.txt | uniq -c | sort -nr | head -10
However I am not sure if it's correct and whether it is showing me 10 most frequent words in the large file.
I have a shell demo to deal with your problem ,even you have a file with more than one Word in one line
wordcount.sh
#!/bin/bash
# filename: wordcount.sh
# usage: word count
# handle position arguments
if [ $# -ne 1 ]
then
echo "Usage: $0 filename"
exit -1
fi
# realize word count
printf "%-14s%s\n" "Word" "Count"
cat $1 | tr 'A-Z' 'a-z' | \
egrep -o "\b[[:alpha:]]+\b" | \
awk '{ count[$0]++ }
END{
for(ind in count)
{ printf("%-14s%d\n",ind,count[ind]); }
}' | sort -k2 -n -r | head -n 10
just run ./wordcount.sh filename.txt
explain
Use the tr command to convert all uppercase letters to lowercase letters, then use the egrep command to grab all the words in the text and output them item by item. Finally, use the awk command and the associative array to implement the word count function, and decrement the output according to the number of occurrences. .

Get file word count using bash script WITHOUT wc command?

Is there an efficient way to use a bash script that will count the number of words in a specified file without utilizing the wc command?
Preferably, is it possible to do this with a simple loop?
I am having trouble coming up with a way to do this and everywhere I've already checked on Google uses the wc command.
Thanks!
This fuzzy code relies on tr and dd, and counts squeezed whitespace, which isn't always the same as the number of words, but it's not far off:
word_count() { tr -s '\n[:space:]' ' ' | tr -cd '[:space:]' |
dd of=/dev/null 2>&1 | tail -1 | (read a b ; echo $a ) ; }
Consider the number of words in man bash, followed by actual wc results:
man bash | word_count
man bash | wc --words
Output:
46139
46139
In this instance they're the same. But word_count is susceptible to off-by one errors if:
There are spaces but no alphanumerics:
printf ' \n' | word_count
Output:
1
There's a word, but no white space:
printf 'foo' | word_count
Output:
0

Linux usernames /etc/passwd listing

I want to print the longest and shortest username found in /etc/passwd. If I run the code below it works fine for the shortest (head -1), but doesn't run for (sort -n |tail -1 | awk '{print $2}). Can anyone help me figure out what's wrong?
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |head -1 | awk '{print $2}'
sort -n |tail -1 | awk '{print $2}'
Here the issue is:
Piping finishes with the first sort -n |head -1 | awk '{print $2}' command. So, input to first command is provided through piping and output is obtained.
For the second command, no input is given. So, it waits for the input from STDIN which is the keyboard and you can feed the input through keyboard and press ctrl+D to obtain output.
Please run the code like below to get desired output:
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |head -1 | awk '{print $2}'
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n |tail -1 | awk '{print $2}
'
All you need is:
$ awk -F: '
NR==1 { min=max=$1 }
length($1) > length(max) { max=$1 }
length($1) < length(min) { min=$1 }
END { print min ORS max }
' /etc/passwd
No explicit loops or pipelines or multiple commands required.
The problem is that you only have two pipelines, when you really need one. So you have grep | while read do ... done | sort | head | awk and sort | tail | awk: the first sort has an input (i.e., the while loop) - the second sort doesn't. So the script is hanging because your second sort doesn't have an input: or rather it does, but it's STDIN.
There's various ways to resolve:
save the output of the while loop to a temporary file and use that as an input to both sort commands
repeat your while loop
use awk to do both the head and tail
The first two involve iterating over the password file twice, which may be okay - depends what you're ultimately trying to do. But using a small awk script, this can give you both the first and last line by way of the BEGIN and END blocks.
While you already have good answers, you can also use POSIX shell to accomplish your goal without any pipe at all using the parameter expansion and string length provided by the shell itself (see: POSIX shell specifiction). For example you could do the following:
#!/bin/sh
sl=32;ll=0;sn=;ln=; ## short len, long len, short name, long name
while read -r line; do ## read each line
u=${line%%:*} ## get user
len=${#u} ## get length
[ "$len" -lt "$sl" ] && { sl="$len"; sn="$u"; } ## if shorter, save len, name
[ "$len" -gt "$ll" ] && { ll="$len"; ln="$u"; } ## if longer, save len, name
done </etc/passwd
printf "shortest (%2d): %s\nlongest (%2d): %s\n" $sl "$sn" $ll "$ln"
Example Use/Output
$ sh cketcpw.sh
shortest ( 2): at
longest (17): systemd-bus-proxy
Using either pipe/head/tail/awk or the shell itself is fine. It's good to have alternatives.
(note: if you have multiple users of the same length, this just picks the first, you can use a temp file if you want to save all names and use -le and -ge for the comparison.)
If you want both the head and the tail from the same input, you may want something like sed -e 1b -e '$!d' after you sort the data to get the top and bottom lines using sed.
So your script would be:
#!/bin/bash
grep -Eo '^([^:]+)' /etc/passwd |
while read NAME
do
echo ${#NAME} ${NAME}
done |
sort -n | sed -e 1b -e '$!d'
Alternatively, a shorter way:
cut -d":" -f1 /etc/passwd | awk '{ print length, $0 }' | sort -n | cut -d" " -f2- | sed -e 1b -e '$!d'

How to generate string elements that don't match a pattern?

If I have
days="1 2 3 4 5 6"
func() {
echo "lSecure1"
echo "lSecure"
echo "lSecure4"
echo "lSecure6"
echo "something else"
}
and do
func | egrep "lSecure[1-6]"
then I get
lSecure1
lSecure4
lSecure6
but what I would like is
lSecure2
lSecure3
lSecure5
which is all the days that doesn't have a lSecure string.
Question
My current idea is to use awk to split the $days and then loop over all combinations.
Is there a better way?
Note that grep -v inverts the sense of a plain grep and does not solve the problem as it does not generate the required strings.
I usually use the -f flag of grep for similar purposes. The <( ... ) code generates a file with all possibilities, grep only selects those not present in the func.
func | grep 'lSecure[1-6]' | grep -v -f- <( for i in $days ; do echo lSecure$i ; done )
Or, you may prefer it the other way round:
for i in $days ; do echo lSecure$i ; done | grep -vf <( func | grep 'lSecure[1-6]' )
F=$(func)
for f in $days; do
if ! echo $F | grep -q lSecure$f; then
echo lSecure$f
fi
done
An awk solution:
$ func | awk -v i="${days}" 'BEGIN{split(i,a," ")}{gsub(/lSecure/,"");
for(var in a)if(a[var] == $0){delete a[var];break}}
END{for(var in a) print "lSecure" a[var]}' | sort
We store it in an awk array a then while reading a line, get the last number, if it is present in array, then remove that from the array. So at the end, in the array, only those element which have not been found remains. Sort is just to present in a sorted manner :)
I am not sure exactly what you are trying to achieve, but you might consider using uniq -u which deletes repeated sequences. For example you can do this with it:
( echo "$days" | tr -s ' ' '\n'; func | grep -oP '(?<=lSecure)[1-6]' ) | sort | uniq -u
Output:
2
3
5

How do I find the count of multiple words in a text file?

I am able to find the number of times a word occurs in a text file, like in Linux we can use:
cat filename|grep -c tom
My question is, how do I find the count of multiple words like "tom" and "joe" in a text file.
Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
test.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! Hope this helps
Ok, so first split the file into words, then sort and uniq:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use uniq:
sort filename | uniq -c
Use awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input.
Pipe tho output to grep to get the desired fields
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-)
The sample you gave does not search for words "tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
\<\(tom\|joe\)\>
You could do regexp,
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
Here is one:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE
A shell script solution:
#!/bin/bash
file_name="$2"
string="$1"
if [ $# -ne 2 ]
then
echo "Usage: $0 <pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo
I completely forgot about grep -f:
cat filename | grep -fc names
AWK solution:
Assuming the names are in a file called names:
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
Note that your original grep doesn't search for words. e.g.
$ echo tomorrow | grep -c tom
1
You need grep -w
gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.
We use gawk because the POSIX awk doesn't allow regex record separator.
For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print}.")
To find all hits in all lines
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
This will count "tomtom" as 2 hits.

Resources