How can I find the number of 8 letter words that do not contain the letter "e", using the grep command? - linux

I want to find the number of 8 letter words that do not contain the letter "e" in a number of text files (*.txt). In the process I ran into two issues: my lack of understanding in quantifiers and how to exclude characters.
I'm quite new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i ".*[^e].*"
I need to include the cat command because it otherwise includes the names of the text files in the pipe. The second pipe is to have all the words in a list, and it works, but the last pipe was meant to find all the words that do not have the letter "e" in them, but doesn't seem to work. (I thought "." for no or any number of any character, followed by a character that is not an "e", and followed by another "." for no or any number of any character.)
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]"
This command works to find the words that contain 8 characters, but it is quite ineffective, because I have to repeat "[a-z]" 8 times. I thought it could also be "[a-z]{8}", but that doesn't seem to work.
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]" | grep -i ".*[^e].*"
So finally, this would be my best guess, however, the third pipe is ineffective and the last pipe doesn't work.

You may use this grep:
grep -hEiwo '[a-df-z]{8}' *.txt
Here:
[a-df-z]{8}: Matches all letters except e
-h: Don't print filename in output
-i: Ignore case search
-o: Print matches only
-w: Match complete words

In case you are ok with GNU awk and assuming that you want to print only the exact words and could be multiple matches in a line if this is the case one could try following.
awk -v IGNORECASE="1" '{for(i=1;i<=NF;i++){if($i~/^[a-df-z]{8}$/){print $i}}}' *.txt
OR without the use of IGNORCASE one could try:
awk '{for(i=1;i<=NF;i++){if(tolower($i)~/^[a-df-z]{8}$/){print $i}}}' *.txt
NOTE: Considering that you want exact matches of 8 letters only in lines. 8 letter words followed by a punctuation mark will be excluded.

Here is a crazy thought with GNU awk:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{c+=NF}END{print c}' file
Or if you want to make it work only on a select set of characters:
awk 'BEGIN{FPAT="\\<[a-df-z]{8}\\>"}{c+=NF}END{print c}' file
What this does is, it defines the fields, to be a set of 8 characters (\w as a word-constituent or [a-df-z] as a selected set) which is enclosed by word-boundaries (\< and \>). This is done with FPAT (note the Gory details about escaping).
Sometimes you might also have words which contain diatrics, so you have to expand. Then this might be the best solution:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{for(i=1;i<=NF;++i) if($i !~ /e/) c++}END{print c}' file

Related

Linux Bash: extracting text from file int variable

I haven't found anything that clearly answers my question. Although very close, I think...
I have a file with a line:
# Skipsdata for serienummer 1158
I want to extract the 4 digit number at the end and put it into a variable, this number changes from file to file so I can't just search for "1158". But the "# Skipsdata for serienummer" always remains the same.
I believe that either grep, sed or awk may be the answer but I'm not 100 % clear on their usage.
Using Awk as
numberRequired=$(awk '/# Skipsdata for serienummer/{print $NF}' file)
printf "%s\n" "$numberRequired"
1158
You can use grep with the -o switch, which prints only the matched part instead of the whole line.
Print all numbers at the end of lines from file yourFile
grep -Po '\d+$' yourFile
Print all four digit numbers at the end of lines like described in your question:
grep -Po '^# Skipsdata for serienummer \K\d{4}$' yourFile
-P enables perl style regexes which support \d and especially \K.
\d matches any digit (0-9).
\d{4} matches exactly four digits.
\K lets grep forget the previously matched part, such that only the part afterwards is printed.
There are multiple ways to find your number. Assuming the input data is in a file called inputfile:
mynumber=$(sed -n 's/# Skipsdata for serienummer //p' <inputfile) will print only the number and ignore all the other lines;
mynumber=$(grep '^# Skipsdata for serienummer' inputfile | cut -d ' ' -f 5) will filter the relevant lines first, then only output the 5th field (the number)

Extracting key word from a log line

I have a log which got like this :
.....client connection.....remote=/xxx.xxx.xxx.xxx]].......
I need to extract all lines in the log which contain the above,and print just the ip after remote=.. This would be something in the pattern :
grep "client connection" xxx.log | sed -e ....
Using grep:
grep -oP '(?<=remote=/)[^\]]+' file
o is to extract only the pattern, instead of entire line.
P is to match perl like regex. In this case, we are using "negative look behind". It will try to match set of characters which is not "]" which is preceeded by remote=/
grep -oP 'client connection.*remote=/\K.*?(?=])' input
Prints anything between remote=/ and closest ] on the lines which contain client connection.
Or by using sed back referencing: Here the line is divided into three parts/groups which are later referred by \1 \2 or \3. Each group is enclosed by ( and ). Here IP address belongs to 2nd group, so whole line is replaced by 2nd group which is IP address.
sed -r '/client connection/ s_(^.*remote=/)(.*?)]](.*)_\2_g' input
Or using awk :
awk -F'/|]]' '/client connection/{print $2}' input
Try this:
grep 'client connection' test.txt | awk -F'[/\\]]' '{print $2}'
Test case
test.txt
---------
abcd
.....client connection.....remote=/10.20.30.40]].......
abcs
.....client connection.....remote=/11.20.30.40]].......
.....client connection.....remote=/12.20.30.40]].......
Result
10.20.30.40
11.20.30.40
12.20.30.40
Explanation
grep will shortlist the results to only lines matching client connection. awk uses -F flag for delimiter to split text. We ask awk to use / and ] delimiters to split text. In order to use more than one delimiter, we place the delimiters in [ and ]. For example, to split text by = and :, we'd do [=:].
However, in our case, one of the delimiters is ] since my intent is to extract IP specifically from /x.x.x.x] by spitting the text with / and ]. So we escape it ]. The IP is the 2nd item from the splitting.
A more robust way, improved over this answer would be to also use GNU grep in PCRE mode with -P for perl style regEx match, but matching both the patterns as suggested in the question.
grep -oP "client connection.*remote=/\K(\d{1,3}\.){3}\d{1,3}" file
10.20.30.40
11.20.30.40
12.20.30.40
Here, client connection.*remote matches both the patterns in the lines and extracts IP from the file. The \K is a PCRE syntax to ignore strings up to that point and print only the capture group following it.
(\d{1,3}\.){3}\d{1,3}
To match the IP i.e. 3 groups of digits separated by dots of length from 1 to 3 followed by 4th octet.

Using grep to get 12 letter alphabet only lines

Using grep
How many 12 letter - alphabet only lines are in testing.txt?
excerpt of testing.txt
tyler1
Tanktop_Paedo
xyz2#geocities.com
milt#uole.com
justincrump
cranges10
namer#uole.com
soulfunkbrotha
timetolearnz
hotbooby#geocities.com
Fire_Crazy
helloworldad
dingbat#geocities.com
from this excerpt, I want to get a result of 2. (helloworldad, and timetolearnz)
I want to check every line and grep only those that have 12 characters in each line. I can't think of a way to do this with grep though.
For the alphabet only, I think I can use
grep [A-Za-z] testing.txt
However, how do I make it so only the characters [A-Za-z] show up in those 12 characters?
You can do it with extended regex -E and by specifying that the match is exactly {12} characters from start ^ to finish $
$ grep -E "^[A-Za-z]{12}$" testing.txt
timetolearnz
helloworldad
Or if you want to get the count -c of the lines you can use
$ grep -cE "^[A-Za-z]{12}$" testing.txt
2
grep supports whole-line match and counting, e.g.:
grep -xc '[[:alpha:]]\{12\}' testing.txt
Output:
2
The [:alpha:] character class is another way of saying [A-Za-z]. See section 3.2 of the the info pages: info grep 'Regular Expressions' 'Character Classes and Bracket Expressions' for more on this subject. Or look it up in the pdf manual online.

how to grep range of numbers

in a text file I have the following entries:
10.1.0.10-15
10.1.0.20-25
10.1.0.30-35
10.1.0.40-45
I would like to print 10.1.0.10,15, 20, 25,30
cat file | grep 10.1.0.[1,2,3][0.5] -- prints 10,15,20,25,30, 35.
How do I suppress 35?
I do not want to use grep -v .35 ...just want to print specific IPs or #s.
You can use:
grep -E '10\.1\.0\.([12][05]|30)' file
However awk will be more readable:
awk -F '[.-]' '$4%5 == 0 && $4 >= 10 && $4 <= 30' file
10.1.0.10-15
10.1.0.20-25
10.1.0.30-35
Note that the , and . in the character classes are not needed — in fact, they match data that you don't want the pattern to match. Also, the . outside the character classes match any character (digit, letter, or . as you intend) — you need to escape them with a backslash so that they only match an actual ..
Also, you are making Useless Use of cat (UUoC) errors; grep can perfectly well read from a file.
As to what to do, probably use:
grep -E '10\.1\.0\.([12][05]|30)' file
This uses the extended regular expressions (formerly for egrep, now grep -E). It also avoids the dots from matching any character.
I'm not sure if what you want is just printing the first two IPs, excluding that one with 35. In that case cat file | grep '10.1.0.[1-3]0.[15|25]' does the job.
Remember that you can use conditional expressions such as | to help you.

How to display the first word of each line in my file using the linux commands?

I have a file containing many lines, and I want to display only the first word of each line with the Linux commands.
How can I do that?
You can use awk:
awk '{print $1}' your_file
This will "print" the first column ($1) in your_file.
Try doing this using grep :
grep -Eo '^[^ ]+' file
try doing this with coreutils cut :
cut -d' ' -f1 file
I see there are already answers. But you can also do this with sed:
sed 's/ .*//' fileName
The above solutions seem to fit your specific case. For a more general application of your question, consider that words are generally defined as being separated by whitespace, but not necessarily space characters specifically. Columns in your file may be tab-separated, for example, or even separated by a mixture of tabs and spaces.
The previous examples are all useful for finding space-separated words, while only the awk example also finds words separated by other whitespace characters (and in fact this turns out to be rather difficult to do uniformly across various sed/grep versions). You may also want to explicitly skip empty lines, by amending the awk statement thus:
awk '{if ($1 !="") print $1}' your_file
If you are also concerned about the possibility of empty fields, i.e., lines that begin with whitespace, then a more robust solution would be in order. I'm not adept enough with awk to produce a one-liner for such cases, but a short python script that does the trick might look like:
>>> import re
>>> for line in open('your_file'):
... words = re.split(r'\s', line)
... if words and words[0]:
... print words[0]
...or on Windows (if you have GnuWin32 grep) :
grep -Eo "^[^ ]+" file

Resources