Find words containing 20 vowels grep - linux

I found many similar questions but most of them ask for vowels in a row which is easy. I want to find words that contain 20 vowels not in a row using grep.
I originally thought grep -Ei [aeiou]{20} would do it but that seems to search only for 20 vowels in a row

Use a regular expression that searches for 20 vowels separated by any quantity of consonants.
grep -Ei "[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*\
[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*[aeiou][b-df-hj-np-tv-z]*"
The backslash is just informing the shell that the expression continues on the next line. It is not part of the regex itself.
If you understand that part, you can shorten it considerably using groups. This regexp is the same as above, but using groups in parenthesis with repetition.
grep -Ei "([aeiou][b-df-hj-np-tv-z]*){20}"

I don't believe that's a problem that calls for just a regex. Here's a programmatic approach. We redefine the field separator to the empty string; each character is a field. We iterate over the line; if a character is a vowel we increment a counter. If, at the end of the string, the count is 20, we print it:
cat nicks.awk
BEGIN{
FS=""
}
{
c=0;
for( i=1;i<=NF;i=i+1 ){
if ($i ~ /[aeiou]/ ){
c=c+1;
}
};
if(c==20){
print $0
}
}
And this is what it does ... it only prints back the one string that has 20 vowels.
echo "contributorNickSequestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
echo "contributorNickSeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk
contributorNickSeoquestionsfoundcontainingvowelsgrcep
echo "contributorNickSaeoquestionsfoundcontainingvowelsgrcep" | awk -f nicks.awk

If all you really need is to find 20 vowels in a line then that's just:
awk '{x=tolower($0)} gsub(/[aeiou]/,"&",x)==20' file
or with grep:
grep -Ei '^[^aeiou]*([aeiou][^aeiou]*){20}$' file
To find words (assuming each is space separated) there's many options including this with GNU awk:
awk -v RS='\\s+' -v IGNORECASE=1 'gsub(/[aeiou]/,"&")==20' file
or this with any awk:
awk '{for (i=1;i<=NF;i++) {x=tolower($i); if (gsub(/[aeiou]/,"&",x)==20) print $i} }' file

Related

Select subdomains using print command

cat a.txt
a.b.c.d.e.google.com
x.y.z.google.com
rev a.txt | awk -F. '{print $2,$3}' | rev
This is showing:
e google
x google
But I want this output
a.b.c.d.e.google
b.c.d.e.google
c.d.e.google
e.google
x.y.z.google
y.z.google
z.google
With your shown samples, please try following awk code. Written and tested in GNU awk should work in any awk.
awk '
BEGIN{
FS=OFS="."
}
{
nf=NF
for(i=1;i<(nf-1);i++){
print
$1=""
sub(/^[[:space:]]*\./,"")
}
}
' Input_file
Here is one more awk solution:
awk -F. '{while (!/^[^.]+\.[^.]+$/) {print; sub(/^[^.]+\./, "")}}' file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using sed
$ sed -En 'p;:a;s/[^.]+\.(.*([^.]+\.){2}[[:alpha:]]+$)/\1/p;ta' input_file
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
Using bash:
IFS=.
while read -ra a; do
for ((i=${#a[#]}; i>2; i--)); do
echo "${a[*]: -i}"
done
done < a.txt
Gives:
a.b.c.d.e.google.com
b.c.d.e.google.com
c.d.e.google.com
d.e.google.com
e.google.com
x.y.z.google.com
y.z.google.com
z.google.com
(I assume the lack of d.e.google.com in your expected output is typo?)
For a shorter and arguably simpler solution, you could use Perl.
To auto-split the line on the dot character into the #F array, and then print the range you want:
perl -F'\.' -le 'print join(".", #F[0..$#F-1])' a.txt
-F'\.' will auto-split each input line into the #F array. It will split on the given regular expression, so the dot needs to be escaped to be taken literally.
$#F is the number of elements in the array. So #F[0..$#F-1] is the range of elements from the first one ($F[0]) to the penultimate one. If you wanted to leave out both "google" and "com", you would use #F[0..$#F-2] etc.

Replaceing multiple command calls

Able to trim and transpose the below data with sed, but it takes considerable time. Hope it would be better with AWK. Welcome any suggestions on this
Input Sample Data:
[INX_8_60L ] :9:Y
[INX_8_60L ] :9:N
[INX_8_60L ] :9:Y
[INX_8_60Z ] :9:Y
[INX_8_60Z ] :9:Y
Required Output:
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
Just use awk, e.g.
awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
Which will be orders of magnitude faster. It just picks out the (e.g. "INX_8_60L") substring using substring and match. n is simply used as a false/true (0/1) flag to prevent outputting a "!" before the first string.
Example Use/Output
With your data in file you would get:
$ awk -v n=0 '{printf (n?"!%s":"%s", substr ($0,2,match($0,/[ \t]+/)-2)); n=1} END {print ""}' file
INX_8_60L!INX_8_60L!INX_8_60L!INX_8_60Z!INX_8_60Z
Which appears to be what you are after. (Note: I'm not sure what your separator character is, so just change above as needed) If not, let me know and I'm happy to help further.
Edit Per-Changes
Including the '?' isn't difficult, and I just copied the character, so you would now have:
awk -v n=0 '{s=substr($0,2,match($0,/[ \t]+/)-2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1}
END {print ""}' file
Example Output
INX?_8_60L¦INX?_8_60L¦INX?_8_60L¦INX?_8_60Z¦INX?_8_60Z
And to simplify, just operating on the first field as in #JamesBrown's answer, that would reduce to:
awk -v n=0 '{s=substr($1,2); sub(/_/,"?_",s); printf n?"¦%s":"%s", s; n=1} END {print ""}' file
Let me know if that needs more changes.
Don't start so many sed commands, separate the sed operations with semicolon instead.
Try to process the data in a single job and avoid regex. Below reading with substr() static sized first block and insterting ? while outputing.
$ awk '{
b=b (b==""?"":";") substr($1,2,3) "?" substr($1,5)
}
END {
print b
}' file
Output:
INX?_8_60L;INX?_8_60L;INX?_8_60L;INX?_8_60Z;INX?_8_60Z
If the fields are not that static in size:
$ awk '
BEGIN {
FS="[[_ ]" # split field with regex
}
{
printf "%s%s?_%s_%s",(i++?";":""), $2,$3,$4 # output semicolons and fields
}
END {
print ""
}' file
Performance of solutions for 20 M records:
Former:
real 0m8.017s
user 0m7.856s
sys 0m0.160s
Latter:
real 0m24.731s
user 0m24.620s
sys 0m0.112s
sed can be very fast when used gingerly, so for simplicity and speed you might wish to consider:
sed -e 's/ .*//' -e 's/\[INX/INX?/' | tr '\n' '|' | sed -e '$s/|$//'
The second call to sed is there to satisfy the requirement that there is no trailing |.
Another solution using GNU awk:
awk -F'[[ ]+' '
{printf "%s%s",(o?"¦":""),gensub(/INX/,"INX?",1,$2);o=1}
END{print ""}
' file
The field separator is set (with -F option) such that it matches the wanted parameter.
The main statement is to print the modified parameter with the ? character.
The variable o allows to keep track of the delimeter ¦.

Extract multiple floating numbers from a line

I want to extract timeTaken values from following line:
<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting.
I am using following command with grep and awk:
grep -Po "Exception, Curl1-Time: \K(\d+.\d*)s. Curl2-Time: (\d+.\d+)" app.log | awk '{print $1 + $3}'
This outputs: 4.167565
Can this be done in more smarter way, maybe using sed or any other
bash tool.
Is it ok to ignore trailing "s." in time-taken
values as the result of addition is correct.
You already use PCRE. Why not use Perl itself?
perl -lne 'print $1 + $2
if /Exception, Curl1-Time: ([\d.]+)s\. Curl2-Time: ([\d.]+)/
' < input
If you have GNU's grep, then you can execute:
var="<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting."
grep -Eo '[[:digit:]]+\.[[:digit:]]+s?' <<< "$var"
Or you can use awk and stay POSIX:
var="<some other log data> Exception, Curl1-Time: 0.258315s. Curl2-Time: 3.9092588424683s Exiting."
awk '{ while (match($0, /[[:digit:]]+\.[[:digit:]]+s?/)) { print substr($0, RSTART, RLENGTH); $0 = substr($0, RSTART + RLENGTH) } }' <<< "$var"
As you can see, both commands use the regex [[:digit:]]+\.[[:digit:]]+s? to match a pattern of one or more digits, a dot, one or more digits and an optional 's'.
GNU's grep uses the -o option to extract the matching regex pattern.
The awk version uses its match and substr functions, to match and extract relevant data.
After a regex match, RSTART and RLENGTH are set and we can use them to calculate a start and end positions for substr.
RLENGTH is the length of the substring matched by the match function.
RSTART is the start-index in characters of the substring matched by the match function.
see section Built-in Functions for String Manipulation
sed 's/.*Curl1-Time: \([0-9]\.[0-9]*\)s.*\([0-9]\.[0-9]*\)s.*$/\1 \2/p' filename | awk '{print ($1+$2);}'
Regex pattern matching ".Curl1-Time: ([0-9].[0-9])s.([0-9].[0-9])s.*$" ---> Pattern within the braces is the number matching regex.
Entire line is replaced with two matching patterns. i.e the output of sed will be two numbers with spaces in between them. e.g. 1234 34567
awk parses the sed output with default space delimiter and sums up them and prints the result.

sed: remove whole words containg a character class

I'd like to remove any word which contains a non alpha char from a text file. e.g
"ok 0bad ba1d bad3 4bad4 5bad5bad5"
should become
"ok"
I've tried using
echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/\b[a-zA-Z]*[^a-zA-Z]\+[a-zA-Z]*\b/ /g'
The following sed command does the job:
sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
It removes all words containing at least one non-alphabetic character. It is better to use POSIX character classes like [:alpha:], because for instance they won't consider the French name "François" as being faulty (i.e. containing a non-alphabetic character).
Explanation
We remove all patterns starting with an arbitrary number of spaces followed by an arbitrary (possibly nil) number of alphabetic characters, followed by at least one non-space and non-alphabetic character, and then glob to the end of the word (i.e. until the next space). Please note that you may want to swap [:space:] for [:blank:], see this page for a detailed explanation of the difference between these two POSIX classes.
Test
$ echo "ok 0bad ba1d bad3 4bad4 5bad5bad5" | sed 's/[[:space:]]*[[:alpha:]]*[^[:space:][:alpha:]][^[:space:]]*//g'
ok
Using awk:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
awk '{ofs=""; for (i=1; i<=NF; i++) if ($i ~ /^[[:alpha:]]+$/)
{printf "%s%s", ofs, $i; ofs=OFS} print ""}' <<< "$s"
ok
This awk command loops through all words and if word matches the regex /^[[:alpha:]]+$/ then it writes to standard out. (i<NF)?OFS:RS is a short cut to add OFS if current field no is less than NF otherwise it writes RS.
Using grep + tr together:
s="ok 0bad ba1d bad3 4bad4 5bad5bad5"
r=$(grep -o '[^ ]\+' <<< "$s"|grep '^[[:alpha:]]\+$'|tr '\n' ' ')
echo "$r"
ok
First grep -o breaks the string into individual words. 2nd grep only searches for words with alphabets only. ANd finally tr translates \n to space.
If you're not concerned about losing different numbers of spaces between each word, you could use something like this in Perl:
perl -ane 'print join(" ", grep { !/[^[:alpha:]]/ } #F), "\n"
the -a switch enables auto-split mode, which splits the text on any number of spaces and stores the fields in the array #F. grep filters out the elements of that array that contain any non-alphabetical characters. The resulting array is joined on a single space.
This might work for you (GNU sed):
sed -r 's/\b([[:alpha:]]+\b ?)|\S+\b ?/\1/g;s/ $//' file
This uses a back reference within alternation to save the required string.
st="ok 0bad ba1d bad3 4bad4 5bad5bad5"
for word in $st;
do
if [[ $word =~ ^[a-zA-Z]+$ ]];
then
echo $word;
fi;
done

How do I check for a NAWK substring being several possible values?

I am very, very much a beginner with NAWK (or AWK) but I know that you can check for a substring value using:
nawk '{if (substr($0,42,4)=="ABCD") {print {$0}}}' ${file}
(This is being run through UNIX, hence the '$0'.)
What if the string could be either ABCD or MNOP? Is there an easy way to code this as a one-liner? I've tried looking but so far only found myself lost...
For example with:
nawk 'substr($0,42,4)=="ABCD" || substr($0,42,4)=="MNOP"' ${file}
Note your current command does have some unnecessary parts that awk handles implicitly:
nawk '{if (substr($0,42,4)=="ABCD") {print {$0}}}' ${file}
{print {$0}} is the default awk action, so it can be skipped, as well as the if {} condition. All together, you can let it be like
nawk 'substr($0,42,4)=="ABCD"' ${file}
For more reference you can check Idiomatic awk.
Test
$ cat a
hello this is me
hello that is me
hello those is me
$ awk 'substr($0,7,4)=="this"' a
hello this is me
$ awk 'substr($0,7,4)=="this" || substr($0,7,4)=="that"' a
hello this is me
hello that is me
If you have a large list of possible valid values, you can declare an array, then check to see if that substring is in the array.
nawk '
BEGIN { valid["ABCD"] = 1
valid["MNOP"] = 1
# ....
}
substr($0,42,4) in valid
' file
One thing to remember: the in operator looks at an associative array's keys, not the values.
Assuming your values are not regex metacharacters, you could say:
nawk 'substr($0,42,4)~/ABCD|MNOP/' ${file}
If the values contain metacharacters ([, \, ^, $, ., |, ?, *, +, (, )), then you'd need to escape those with a \.
You said "string" not "RE" so this is the approach to take for a string comparison against multiple values:
awk -v strs='ABCD MNOP' '
BEGIN {
split(strs,tmp)
for (i in tmp)
strings[tmp[i]]
}
substr($0,42,4) in strings
' file

Resources