Bash: extract (percent) number of variable length from a string - string

I want to write a little progress bar using a bash script.
To generate the progress bar I have to extract the progress from a log file.
The content of such a file (here run.log) looks like this:
Time to finish 2d 15h, 42.5% completed, time steps left 231856
I'm now intersted to isolate the 42.5%. The problem is now that the length of this digit is variable as well as the position of the number (e.g. 'time to finish' might content only one number like 23h or 59min).
I tried it over the position via
echo "$(tail -1 run.log | awk '{print $6}'| sed -e 's/[%]//g')"
which fails for short 'Time to finish' as well as via the %-sign
echo "$(tail -1 run.log | egrep -o '[0-9][0-9].[0-9]%')"
Here is works only for digits >= 10%.
Any solution for a more variable nuumber extraction?
======================================================
Update: Here is now the full script for the progress bar:
#!/bin/bash
# extract % complete from run.log
perc="$(tail -1 run.log | grep -o '[^ ]*%')"
# convert perc to int
pint="${perc/.*}"
# number of # to plot
nums="$(echo "$pint /2" | bc)"
# output
echo -e ""
echo -e " completed: $perc"
echo -ne " "
for i in $(seq $nums); do echo -n '#'; done
echo -e ""
echo -e " |----.----|----.----|----.----|----.----|----.----|"
echo -e " 0% 20% 40% 60% 80% 100%"
echo -e ""
tail -1 run.log
echo -e ""
Thanks for your help, guys!

based on your example
grep -o '[^ ]*%'
should give what you want.

You can extract % from below command:
tail -n 1 run.log | grep -o -P '[0-9]*(\.[0-9]*)?(?=%)'
Explanation:
grep options:
-o : Print only matching string.
-P : Use perl style regex
regex parts:
[0-9]* : Match any number, repeated any number of times.
(\.[0-9]*)? : Match decimal point, followed by any number of digits.
? at the end of it => optional. (this is to take care of numbers without fraction part.)
(?=%) :The regex before this must be followed by a % sign. (search for "positive look-ahead" for more details.)

You should be able to isolate the progress after the first comma (,) in your file. ie.you want the characters between , and %

There are many ways to achieve your goal. I would prefer using cut several times as it is easy to read.
cut -f1 -d'%' | cut -f2 -d',' | cut -f2 -d' '
After first cut:
Time to finish 2d 15h, 42.5
After second (note space):
42.5
And the last one just to get rid of space, the final result:
42.5

Related

How to find 10 most frequent words in the file in Unix/Linux

How to find 10 most frequent words in the file in Unix/Linux?
I tried using this command in Unix:
sort file.txt | uniq -c | sort -nr | head -10
However I am not sure if it's correct and whether it is showing me 10 most frequent words in the large file.
I have a shell demo to deal with your problem ,even you have a file with more than one Word in one line
wordcount.sh
#!/bin/bash
# filename: wordcount.sh
# usage: word count
# handle position arguments
if [ $# -ne 1 ]
then
echo "Usage: $0 filename"
exit -1
fi
# realize word count
printf "%-14s%s\n" "Word" "Count"
cat $1 | tr 'A-Z' 'a-z' | \
egrep -o "\b[[:alpha:]]+\b" | \
awk '{ count[$0]++ }
END{
for(ind in count)
{ printf("%-14s%d\n",ind,count[ind]); }
}' | sort -k2 -n -r | head -n 10
just run ./wordcount.sh filename.txt
explain
Use the tr command to convert all uppercase letters to lowercase letters, then use the egrep command to grab all the words in the text and output them item by item. Finally, use the awk command and the associative array to implement the word count function, and decrement the output according to the number of occurrences. .

Piping grep to cut

This line:
echo $(grep Uid /proc/1/status) | cut -d ' ' -f 2
Produces output:
0
This line:
grep Uid /proc/1/status | cut -d ' ' -f 2
Produces output:
Uid: 0 0 0 0
My goal was the first output. My question is, why the second command does not produce the output I expected. Why am I required to echo it?
One way to do this is to change the Output Field Separator or OFS variable in the bash shell
IFSOLD="$IFS" # IFS for Internal field separator
IFS=$'\t'
grep 'Uid' /proc/1/status | cut -f 2
0 # Your result
IFS="$IFSOLD"
or the easy way
grep 'Uid' /proc/1/status | cut -d $'\t' -f 2
Note : By the way tab is the default delim for cut as pointed out [ here ]
Use awk
awk '/Uid/ { print $2; }' /proc/1/status
You should almost never need to write something like echo $(...) - it's almost equivalent to calling ... directly. Try echo "$(...)" (which you should always use) instead, and you'll see it behaves like ....
The reason is because when the $() command substitution is invoked without quotes the resulting string is split by Bash into separate arguments before being passed to echo, and echo outputs each argument separated by a single space, regardless of the whitespace generated by the command substitution (in your case tabs).
As sjsam suggested, if you want to cut tab-delimited output, just specify tabs as the delimiter instead of spaces:
cut -d $'\t' -f 2
grep Uid /proc/1/status |sed -r ā€œs/\s+/ /gā€ | awk ā€˜{print $3}ā€™
Output
0

Linux: Extract string from a line including delimiter character using sed command [duplicate]

For example
echo "abc-1234a :" | grep <do-something>
to print only abc-1234a
I think these are closer to what you're getting at, but without knowing what you're really trying to achieve, it's hard to say.
echo "abc-1234a :" | egrep -o '^[^:]+'
... though this will also match lines that have no colon. If you only want lines with colons, and you must use only grep, this might work:
echo "abc-1234a :" | grep : | egrep -o '^[^:]+'
Of course, this only makes sense if your echo "abc-1234a :" is an example that would be replace with possibly multiple lines of input.
The smallest tool you could use is probably cut:
echo "abc-1234a :" | cut -d: -f1
And sed is always available...
echo "abc-1234a :" | sed 's/ *:.*//'
For this last one, if you only want to print lines that include a colon, change it to:
echo "abc-1234a :" | sed -ne 's/ *:.*//p'
Heck, you could even do this in pure bash:
while read line; do
field="${line%%:*}"
# do stuff with $field
done <<<"abc-1234a :"
For information on the %% bit, you can man bash and search for "Parameter Expansion".
UPDATE:
You said:
It's the characters in the first line of input before the colon. The
input could have multiple line though.
The solutions with grep probably aren't your best choice, then, since they'll also print data from subsequent lines that might include colons. Of course, there are many ways to solve this requirement as well. We'll start with sample input:
$ function sample { printf "abc-1234a:foo\nbar baz:\nNarf\n"; }
$ sample
abc-1234a:foo
bar baz:
Narf
You could use multiple pipes, for example:
$ sample | head -1 | grep -Eo '^[^:]*'
abc-1234a
$ sample | head -1 | cut -d: -f1
abc-1234a
Or you could use sed to process only the first line:
$ sample | sed -ne '1s/:.*//p'
abc-1234a
Or tell sed to exit after printing the first line (which is faster than reading the whole file):
$ sample | sed 's/:.*//;q'
abc-1234a
Or do the same thing but only show output if a colon was found (for safety):
$ sample | sed -ne 's/:.*//p;q'
abc-1234a
Or have awk do the same thing (as the last 3 examples, respectively):
$ sample | awk '{sub(/:.*/,"")} NR==1'
abc-1234a
$ sample | awk 'NR>1{nextfile} {sub(/:.*/,"")} 1'
abc-1234a
$ sample | awk 'NR>1{nextfile} sub(/:.*/,"")'
abc-1234a
Or in bash, with no pipes at all:
$ read line < <(sample)
$ printf '%s\n' "${line%%:*}"
abc-1234a
It is possible to do what you want with only sed.
Here is an example:
#!/bin/sh
filename=$1
pattern=yourpattern
# flag -n disables print everyline (default behavior)
sed -n "
1,/$pattern/ {
/$pattern/n # skip line containing pattern
p # print lines ranging from line 1 untill pattern
}
" $filename
exit 0
This works at least for GNU's sed. It should work for other sed too, except
regarding the comments (some implementations of sed don't support comments).
Source: https://www.grymoire.com/Unix/Sed.html

Finding the longest word in a text file

I am trying to make a a simple script of finding the largest word and its number/length in a text file using bash. I know when I use awk its simple and straight forward but I want to try and use this method...lets say I know if a=wmememememe and if I want to find the length I can use echo {#a} its word I would echo ${a}. But I want to apply it on this below
for i in `cat so.txt` do
Where so.txt contains words, I hope it makes sense.
bash one liner.
sed 's/ /\n/g' YOUR_FILENAME | sort | uniq | awk '{print length, $0}' | sort -nr | head -n 1
read file and split the words (via sed)
remove duplicates (via sort | uniq)
prefix each word with it's length (awk)
sort the list by the word length
print the single word with greatest length.
yes this will be slower than some of the above solutions, but it also doesn't require remembering the semantics of bash for loops.
Normally, you'd want to use a while read loop instead of for i in $(cat), but since you want all the words to be split, in this case it would work out OK.
#!/bin/bash
longest=0
for word in $(<so.txt)
do
len=${#word}
if (( len > longest ))
then
longest=$len
longword=$word
fi
done
printf 'The longest word is %s and its length is %d.\n' "$longword" "$longest"
Another solution:
for item in $(cat "$infile"); do
length[${#item}]=$item # use word length as index
done
maxword=${length[#]: -1} # select last array element
printf "longest word '%s', length %d" ${maxword} ${#maxword}
longest=""
for word in $(cat so.txt); do
if [ ${#word} -gt ${#longest} ]; then
longest=$word
fi
done
echo $longest
awk script:
#!/usr/bin/awk -f
# Initialize two variables
BEGIN {
maxlength=0;
maxword=0
}
# Loop through each word on the line
{
for(i=1;i<=NF;i++)
# Assign the maxlength variable if length of word found is greater. Also, assign
# the word to maxword variable.
if (length($i)>maxlength)
{
maxlength=length($i);
maxword=$i;
}
}
# Print out the maxword and the maxlength
END {
print maxword,maxlength;
}
Textfile:
[jaypal:~/Temp] cat textfile
AWK utility is a data_extraction and reporting tool that uses a data-driven scripting language
consisting of a set of actions to be taken against textual data (either in files or data streams)
for the purpose of producing formatted reports.
The language used by awk extensively uses the string datatype,
associative arrays (that is, arrays indexed by key strings), and regular expressions.
Test:
[jaypal:~/Temp] ./script.awk textfile
data_extraction 15
Relatively speedy bash function using no external utils:
# Usage: longcount < textfile
longcount ()
{
declare -a c;
while read x; do
c[${#x}]="$x";
done;
echo ${#c[#]} "${c[${#c[#]}]}"
}
Example:
longcount < /usr/share/dict/words
Output:
23 electroencephalograph's
'Modified POSIX shell version of jimis' xargs-based
answer; still very slow, takes two or three minutes:
tr "'" '_' < /usr/share/dict/words |
xargs -P$(nproc) -n1 -i sh -c 'set -- {} ; echo ${#1} "$1"' |
sort -n | tail | tr '_' "'"
Note the leading and trailing tr bit to get around GNU xargs
difficulty with single quotes.
for i in $(cat so.txt); do echo ${#i}; done | paste - so.txt | sort -n | tail -1
Slow because of the gazillion of forks, but pure shell, does not require awk or special bash features:
$ cat /usr/share/dict/words | \
xargs -n1 -I '{}' -d '\n' sh -c 'echo `echo -n "{}" | wc -c` "{}"' | \
sort -n | tail
23 Pseudolamellibranchiata
23 pseudolamellibranchiate
23 scientificogeographical
23 thymolsulphonephthalein
23 transubstantiationalist
24 formaldehydesulphoxylate
24 pathologicopsychological
24 scientificophilosophical
24 tetraiodophenolphthalein
24 thyroparathyroidectomize
You can easily parallelize, e.g. to 4 CPUs by providing -P4 to xargs.
EDIT: modified to work with the single quotes that some dictionaries have. Now it requires GNU xargs because of -d argument.
EDIT2: for the fun of it, here is another version that handles all kinds of special characters, but requires the -0 option to xargs. I also added -P4 to compute on 4 cores:
cat /usr/share/dict/words | tr '\n' '\0' | \
xargs -0 -I {} -n1 -P4 sh -c 'echo ${#1} "$1"' wordcount {} | \
sort -n | tail

How do I find the count of multiple words in a text file?

I am able to find the number of times a word occurs in a text file, like in Linux we can use:
cat filename|grep -c tom
My question is, how do I find the count of multiple words like "tom" and "joe" in a text file.
Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).
test.txt:
tom is really really cool! joe for the win!
tom is actually lame.
$ grep -c '\<\(tom\|joe\)\>' test.txt
2
As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.
I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.
$ grep -o '\(joe\|tom\)' test.txt|wc -l
3
3...the correct answer! Hope this helps
Ok, so first split the file into words, then sort and uniq:
tr -cs '[:alnum:]' '\n' < testdata | sort | uniq -c
You use uniq:
sort filename | uniq -c
Use awk:
{for (i=1;i<=NF;i++)
count[$i]++
}
END {
for (i in count)
print count[i], i
}
This will produce a complete word frequency count for the input.
Pipe tho output to grep to get the desired fields
awk -f w.awk input | grep -E 'tom|joe'
BTW, you do not need cat in your example, most programs that acts as filters can take the filename as an parameter; hence it's better to use
grep -c tom filename
if not, there is a strong possibility that people will start throwing Useless Use of Cat Award at you ;-)
The sample you gave does not search for words "tom". It will count "atom" and "bottom" and many more.
Grep searches for regular expressions. Regular expression that matches word "tom" or "joe" is
\<\(tom\|joe\)\>
You could do regexp,
cat filename |tr ' ' '\n' |grep -c -e "\(joe\|tom\)"
Here is one:
cat txt | tr -s '[:punct:][:space:][:blank:]'| tr '[:punct:][:space:][:blank:]' '\n\n\n' | tr -s '\n' | sort | uniq -c
UPDATE
A shell script solution:
#!/bin/bash
file_name="$2"
string="$1"
if [ $# -ne 2 ]
then
echo "Usage: $0 <pattern to search> <file_name>"
exit 1
fi
if [ ! -f "$file_name" ]
then
echo "file \"$file_name\" does not exist, or is not a regular file"
exit 2
fi
line_no_list=("")
curr_line_indx=1
line_no_indx=0
total_occurance=0
# line_no_list contains loc k the line number loc k+1 the number
# of times the string occur at that line
while read line
do
flag=0
while [[ "$line" == *$string* ]]
do
flag=1
line_no_list[line_no_indx]=$curr_line_indx
line_no_list[line_no_indx+1]=$((line_no_list[line_no_indx+1]+1))
total_occurance=$((total_occurance+1))
# remove the pattern "$string" with a null" and recheck
line=${line/"$string"/}
done
# if we have entered the while loop then increment the
# line index to access the next array pos in the next
# iteration
if (( flag == 1 ))
then
line_no_indx=$((line_no_indx+2))
fi
curr_line_indx=$((curr_line_indx+1))
done < "$file_name"
echo -e "\nThe string \"$string\" occurs \"$total_occurance\" times"
echo -e "The string \"$string\" occurs in \"$((line_no_indx/2))\" lines"
echo "[Occurence # : Line Number : Nos of Occurance in this line]: "
for ((i=0; i<line_no_indx; i=i+2))
do
echo "$((i/2+1)) : ${line_no_list[i]} : ${line_no_list[i+1]} "
done
echo
I completely forgot about grep -f:
cat filename | grep -fc names
AWK solution:
Assuming the names are in a file called names:
cat filename | awk 'NR==FNR {h[NR] = $1;ct[i] = 0; cnt=NR} NR !=FNR {for(i=1;i<=cnt;++i) if(match($0,h[i])!=0) ++ct[i] } END {for(i in h) print h[i], ct[i]}' names -
Note that your original grep doesn't search for words. e.g.
$ echo tomorrow | grep -c tom
1
You need grep -w
gawk -vRS='[^[:alpha:]]+' '{print}' | grep -c '^(tom|joe|bob|sue)$'
The gawk program sets the record separator to anything non-alphabetic, so every word will end up on a separate line. Then grep counts lines that match one of the words you want exactly.
We use gawk because the POSIX awk doesn't allow regex record separator.
For brevity, you can replace '{print}' with 1 - either way, it's an Awk program that simply prints out all input records ("is 1 true? it is? then do the default action, which is {print}.")
To find all hits in all lines
echo "tom is really really cool! joe for the win!
tom is actually lame." | akw '{i+=gsub(/tom|joe/,"")} END {print i}'
3
This will count "tomtom" as 2 hits.

Resources