I try to count the number of numbers and letters in my file in Bash.
I know that I can use wc -c file to count the number of characters but how can I fix it to only letters and secondly numbers?
Here's a way completely avoiding pipes, just using tr and the shell's way to give the length of a variable with ${#variable}:
$ cat file
123 sdf
231 (3)
huh? 564
242 wr =!
$ NUMBERS=$(tr -dc '[:digit:]' < file)
$ LETTERS=$(tr -dc '[:alpha:]' < file)
$ ALNUM=$(tr -dc '[:alnum:]' < file)
$ echo ${#NUMBERS} ${#LETTERS} ${#ALNUM}
13 8 21
To count the number of letters and numbers you can combine grep with wc:
grep -o [a-z] myfile | wc -c
grep -o [0-9] myfile | wc -c
With little bit of tweaking you can modify it to count numbers or alphabetic words or alphanumeric words like this,
grep -o [a-z]+ myfile | wc -c
grep -o [0-9]+ myfile | wc -c
grep -o [[:alnum:]]+ myfile | wc -c
You can use sed to replace all characters that are not of the kind that you are looking for and then word count the characters of the result.
# 1h;1!H will place all lines into the buffer that way you can replace
# newline characters
sed -n '1h;1!H;${;g;s/[^a-zA-Z]//g;p;}' myfile | wc -c
It's easy enough to just do numbers as well.
sed -n '1h;1!H;${;g;s/[^0-9]//g;p;}' myfile | wc -c
Or why not both.
sed -n '1h;1!H;${;g;s/[^0-9a-zA-Z]//g;p;}' myfile | wc -c
There are a number of ways to approach analyzing the line, word, and character frequency of a text file in bash. Utilizing the bash builtin character case filters (e.g. [:upper:], and so on), you can drill down to the frequency of each occurrence of each character type in a text file. Below is a simple script that reads from stdin and provides the normal wc output as it first line of output, and then outputs the number of upper, lower, digits, punct and whitespace.
#!/bin/bash
declare -i lines=0
declare -i words=0
declare -i chars=0
declare -i upper=0
declare -i lower=0
declare -i digit=0
declare -i punct=0
oifs="$IFS"
# Read line with new IFS, preserve whitespace
while IFS=$'\n' read -r line; do
# parse line into words with original IFS
IFS=$oifs
set -- $line
IFS=$'\n'
# Add up lines, words, chars, upper, lower, digit
lines=$((lines + 1))
words=$((words + $#))
chars=$((chars + ${#line} + 1))
for ((i = 0; i < ${#line}; i++)); do
[[ ${line:$((i)):1} =~ [[:upper:]] ]] && ((upper++))
[[ ${line:$((i)):1} =~ [[:lower:]] ]] && ((lower++))
[[ ${line:$((i)):1} =~ [[:digit:]] ]] && ((digit++))
[[ ${line:$((i)):1} =~ [[:punct:]] ]] && ((punct++))
done
done
echo " $lines $words $chars $file"
echo " upper: $upper, lower: $lower, digit: $digit, punct: $punct, \
whitespace: $((chars-upper-lower-digit-punct))"
Test Input
$ cat dat/captnjackn.txt
This is a tale
Of Captain Jack Sparrow
A Pirate So Brave
On the Seven Seas.
(along with 2357 other pirates)
Example Use/Output
$ bash wcount3.sh <dat/captnjackn.txt
5 21 108
upper: 12, lower: 68, digit: 4, punct: 3, whitespace: 21
You can customize the script to give you as little or as much detail as you like. Let me know if you have any questions.
You can use tr to preserve only alphanumeric characters by combining the the -c (complement) and -d (delete) flags. From there on, it's just a question of some piping:
$ cat myfile.txr | tr -cd [:alnum:] | wc -c
Related
This is the code:
#bash/bin
echo "Enter a sentence:"
read -e -a sentence
char="k"
echo "${sentence}" | awk -F"${char}" '{print NF-1}'
Problem:
It returns this error:
': not a valid identifier sentence -1
Sample Input
Enter a sentence:
thanks and okay
Sample Output
2
Question:
How can I fix this?
Just strip everything but $char, then count the results.
echo "Enter a sentence:"
read -e sentence
char="k"
filtered=${sentence//[^$char]/} # Delete anything *not* a $char
echo "${#filtered}" # Output the length of filtered
Using standard shell, you need a pair of external utilities instead bash's parameter substitution operator.
echo "$sentence" | tr -cd "$char" | wc -c
#bash/bin
OLDIFS=$IFS; IFS=$'\0' # change IFS for avoid issue with spaces
read -a sentence -p "Enter a sentence: "
IFS=$OLDIFS # IFS old value
char="k"
grep -o '.' <<< "$sentence" | grep "$char" | wc -l # first grep explode string characters, second grab character, wc count occurrences
I am trying to count the number of characters present in the variable. I used the below shell command. But I am getting error - command not found in line 4
#!/bin/bash
for i in one; do
n = $i | wc -c
echo $n
done
Can someone help me in this?
In bash you can just write ${#string}, which will return the length of the variable string, i.e. the number of characters in it.
Something like this:
#!/bin/bash
for i in one; do
n=$(echo $i | wc -c)
echo $n
done
Assignments in bash cannot have a space before the equals sign. In addition, you want to capture the output of the command you run and assign that to $n, rather than that statement which would probably just assign $i to $n.
Use the following instead:
#!/bin/bash
for i in one; do
n=`$i | wc -c`
echo $n
done
It can be as simple as that:
str="abcdef"; wc -c <<< "$str"
7
But mind you that end of line counts as a character:
str="abcdef"; cat -A <<< "$str"
abcdef$
If you need to remove it:
str="abcdef"; tr -d '\n' <<< "$str" | wc -c
6
I have a string called $string
string='-d $DESTDIR/ERRORS/$BASEDIR ]] || $MKDIR -p word1 word22 word3.5'
Currently I am piping this through sed twice. Once to pull out special characters and then again to change spaces/tabs to single space.
echo $string | sed 's/[^a-zA-Z0-9]/ /g' | sed 's/\s\s*/ /g'
output='d DESTDIR ERRORS BASEDIR MKDIR p word1 word22 word3 5'
Although this works I am looking to gain some efficiency.
Can someone help me consolidate this to a single sed command?
EDIT
I should have noted that this needs to be POSIX compatible for HP/SOL/LIN
Use tr:
echo $string | tr -Cs a-zA-Z0-9 ' '
tr is a very powerful (and fast) tool for translating, deleting and squeezing characters.
This particular command translates every character from the Complement of the first set (a-zA-Z0-9) into characters from the second set; since the second set contains only a space, this translates all non-alphanumeric characters (including tabs) into spaces. It then squeezes all sequences of characters from the second character set into a single character; this replaces runs of spaces with single spaces.
Example:
$ string='-d $DESTDIR/ERRORS/$BASEDIR ]] || $MKDIR -p word1 word22 word3.5'
$ output=$(echo $string | tr -Cs 'a-zA-Z0-9' ' ')
$ echo $output
d DESTDIR ERRORS BASEDIR MKDIR p word1 word22 word3 5
Try this:
echo $string | sed -r 's/[^[:alnum:]]/ /g;s/ +/ /g'
Guru.
I am trying to make a a simple script of finding the largest word and its number/length in a text file using bash. I know when I use awk its simple and straight forward but I want to try and use this method...lets say I know if a=wmememememe and if I want to find the length I can use echo {#a} its word I would echo ${a}. But I want to apply it on this below
for i in `cat so.txt` do
Where so.txt contains words, I hope it makes sense.
bash one liner.
sed 's/ /\n/g' YOUR_FILENAME | sort | uniq | awk '{print length, $0}' | sort -nr | head -n 1
read file and split the words (via sed)
remove duplicates (via sort | uniq)
prefix each word with it's length (awk)
sort the list by the word length
print the single word with greatest length.
yes this will be slower than some of the above solutions, but it also doesn't require remembering the semantics of bash for loops.
Normally, you'd want to use a while read loop instead of for i in $(cat), but since you want all the words to be split, in this case it would work out OK.
#!/bin/bash
longest=0
for word in $(<so.txt)
do
len=${#word}
if (( len > longest ))
then
longest=$len
longword=$word
fi
done
printf 'The longest word is %s and its length is %d.\n' "$longword" "$longest"
Another solution:
for item in $(cat "$infile"); do
length[${#item}]=$item # use word length as index
done
maxword=${length[#]: -1} # select last array element
printf "longest word '%s', length %d" ${maxword} ${#maxword}
longest=""
for word in $(cat so.txt); do
if [ ${#word} -gt ${#longest} ]; then
longest=$word
fi
done
echo $longest
awk script:
#!/usr/bin/awk -f
# Initialize two variables
BEGIN {
maxlength=0;
maxword=0
}
# Loop through each word on the line
{
for(i=1;i<=NF;i++)
# Assign the maxlength variable if length of word found is greater. Also, assign
# the word to maxword variable.
if (length($i)>maxlength)
{
maxlength=length($i);
maxword=$i;
}
}
# Print out the maxword and the maxlength
END {
print maxword,maxlength;
}
Textfile:
[jaypal:~/Temp] cat textfile
AWK utility is a data_extraction and reporting tool that uses a data-driven scripting language
consisting of a set of actions to be taken against textual data (either in files or data streams)
for the purpose of producing formatted reports.
The language used by awk extensively uses the string datatype,
associative arrays (that is, arrays indexed by key strings), and regular expressions.
Test:
[jaypal:~/Temp] ./script.awk textfile
data_extraction 15
Relatively speedy bash function using no external utils:
# Usage: longcount < textfile
longcount ()
{
declare -a c;
while read x; do
c[${#x}]="$x";
done;
echo ${#c[#]} "${c[${#c[#]}]}"
}
Example:
longcount < /usr/share/dict/words
Output:
23 electroencephalograph's
'Modified POSIX shell version of jimis' xargs-based
answer; still very slow, takes two or three minutes:
tr "'" '_' < /usr/share/dict/words |
xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} "$1"' |
sort -n | tail | tr '_' "'"
Note the leading and trailing tr bit to get around GNU xargs
difficulty with single quotes.
for i in $(cat so.txt); do echo ${#i}; done | paste - so.txt | sort -n | tail -1
Slow because of the gazillion of forks, but pure shell, does not require awk or special bash features:
$ cat /usr/share/dict/words | \
xargs -n1 -I '{}' -d '\n' sh -c 'echo `echo -n "{}" | wc -c` "{}"' | \
sort -n | tail
23 Pseudolamellibranchiata
23 pseudolamellibranchiate
23 scientificogeographical
23 thymolsulphonephthalein
23 transubstantiationalist
24 formaldehydesulphoxylate
24 pathologicopsychological
24 scientificophilosophical
24 tetraiodophenolphthalein
24 thyroparathyroidectomize
You can easily parallelize, e.g. to 4 CPUs by providing -P4 to xargs.
EDIT: modified to work with the single quotes that some dictionaries have. Now it requires GNU xargs because of -d argument.
EDIT2: for the fun of it, here is another version that handles all kinds of special characters, but requires the -0 option to xargs. I also added -P4 to compute on 4 cores:
cat /usr/share/dict/words | tr '\n' '\0' | \
xargs -0 -I {} -n1 -P4 sh -c 'echo ${#1} "$1"' wordcount {} | \
sort -n | tail
$ cat isbndb.sample | wc -l
13
$ var=$(cat isbndb.sample); echo $var | wc -l
1
Why is the newline character missing when I assign the string to the variable? How can I keep the newline character from being converted into a space?
I am using bash.
You have to quote the variable to preserve the newlines.
$ var=$(cat isbndb.sample); echo "$var" | wc -l
And cat is unnecessary in both cases:
$ wc -l < isbndb.sample
$ var=$(< isbndb.sample); echo "$var" | wc -l
Edit:
Bash normally strips extra trailing newlines from a file when it assigns its contents to a variable. You have to resort to some tricks to preserve them. Try this:
IFS='' read -d '' var < isbndb.sample; echo "$var" | wc -l
Setting IFS to null prevents the file from being split on the newlines and setting the delimiter for read to null makes it accept the file until the end of file.
var=($(< file))
echo ${#var[#]}