grep perfect string match matches dash in the prefix - string

I am trying to match the rows where there is just "glasses" or something like "blabla,glasses,blabla", using grep.
for example, in this file:
1 this-is-fake-glasses
2 glasses
3 abc,glasses
4 glasses,abc
5 abc,glasses,abc
my expected output includes just the last two lines.
Following what I tried
grep -w "glasses$" t.txt # matches first and last
grep -o "glasses" t.xt # matches all
grep -E '(^|\s)glasses($|\s)' t matches just last

According to your constraints, the following output only last two rows..
grep -v "glasses$" t.txt
As you don't want glasses to the ending word in the line?

Related

How can I find the number of 8 letter words that do not contain the letter "e", using the grep command?

I want to find the number of 8 letter words that do not contain the letter "e" in a number of text files (*.txt). In the process I ran into two issues: my lack of understanding in quantifiers and how to exclude characters.
I'm quite new to the Unix terminal, but this is what I have tried:
cat *.txt | grep -Eo "\w+" | grep -i ".*[^e].*"
I need to include the cat command because it otherwise includes the names of the text files in the pipe. The second pipe is to have all the words in a list, and it works, but the last pipe was meant to find all the words that do not have the letter "e" in them, but doesn't seem to work. (I thought "." for no or any number of any character, followed by a character that is not an "e", and followed by another "." for no or any number of any character.)
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]"
This command works to find the words that contain 8 characters, but it is quite ineffective, because I have to repeat "[a-z]" 8 times. I thought it could also be "[a-z]{8}", but that doesn't seem to work.
cat *.txt | grep -Eo "\w+" | grep -wi "[a-z][a-z][a-z][a-z][a-z][a-z][a-z][a-z]" | grep -i ".*[^e].*"
So finally, this would be my best guess, however, the third pipe is ineffective and the last pipe doesn't work.
You may use this grep:
grep -hEiwo '[a-df-z]{8}' *.txt
Here:
[a-df-z]{8}: Matches all letters except e
-h: Don't print filename in output
-i: Ignore case search
-o: Print matches only
-w: Match complete words
In case you are ok with GNU awk and assuming that you want to print only the exact words and could be multiple matches in a line if this is the case one could try following.
awk -v IGNORECASE="1" '{for(i=1;i<=NF;i++){if($i~/^[a-df-z]{8}$/){print $i}}}' *.txt
OR without the use of IGNORCASE one could try:
awk '{for(i=1;i<=NF;i++){if(tolower($i)~/^[a-df-z]{8}$/){print $i}}}' *.txt
NOTE: Considering that you want exact matches of 8 letters only in lines. 8 letter words followed by a punctuation mark will be excluded.
Here is a crazy thought with GNU awk:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{c+=NF}END{print c}' file
Or if you want to make it work only on a select set of characters:
awk 'BEGIN{FPAT="\\<[a-df-z]{8}\\>"}{c+=NF}END{print c}' file
What this does is, it defines the fields, to be a set of 8 characters (\w as a word-constituent or [a-df-z] as a selected set) which is enclosed by word-boundaries (\< and \>). This is done with FPAT (note the Gory details about escaping).
Sometimes you might also have words which contain diatrics, so you have to expand. Then this might be the best solution:
awk 'BEGIN{FPAT="\\<\\w{8}\\>"}{for(i=1;i<=NF;++i) if($i !~ /e/) c++}END{print c}' file

How to extract a specific text from gz file?

I need to extract the 5 to 11 characters from my fastq.gz data this data is just too large for running in R. So I was wondering if I can do it directly in Linux command line?
The fastq file looks like this:
#NB501399:67:HFKTCBGX5:1:11101:13202:1044 1:N:0:CTTGTA
GAGGTNACGGAGTGGGTGTGTGCAGGGCCTGGTGGGAATGGGGAGACCCGTGGACAGAGCTTGTTAGAGTGTCCTAGAGCCAGGGGGAACTCCAGGCAGGGCAAATTGGGCCCTGGATGTTGAGAAGCTGGGTAACAAGTACTGAGAGAAC
+
AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAAEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE6
#NB501399:67:HFKTCBGX5:1:11101:1109:1044 1:N:0:CTTGTA
TAGGCNACCTGGTGGTCCCCCGCTCCCGGGAGGTCACCATATTGATGCCGAACTTAGTGCGGACACCCGATCGGCATAGCGCACTACAGCCCAGAACTCCTGGACTCAAGCGATCCTCCAGCCTCAGCCTCCCGAGTAGCTGGGACTACAG
+
And I only want to extract the 5 to 11 character which located in sequence part (for the first one is TNACGG, for the second is CNACCT) and makes it a new txt file. Can I do that?
You can use GNU sed with zcat:
zcat fastq.gz | sed -n '2~5{s/.\{4\}\(.\{6\}\).*/\1/;p}'
-n means lines are not printed by default
2~5 means start with line 2, match every fifth line
when the "address" matches, the substitution remembers the fifth to tenth character in \1 and replaces the whole line with it, p prints the result
Another using zgrep and positive lookbehind:
$ zgrep -oP "(?<=^[ACTGN]{4})[ACTGN]{6}" foo.gz
TNACGG
CNACCT
Explained:
zgrep : man zgrep: search possibly compressed files for a regular expression
-o Print only the matched (non-empty) parts of a matching line
-P Interpret the pattern as a Perl-compatible regular expression (PCRE).
(?<=^[ACTGN]{4}) positive lookbehind
[ACTGN]{6} match 6 named characters that are preceeded by above
foo.gz my test file
$ zcat fastq.gz | awk '(NR%5)==2{print substr($0,5,6)}'
TNACGG
CNACCT

How can I get a whole word with grep? [duplicate]

This question already has answers here:
Display exact matches only with grep [duplicate]
(9 answers)
How to make grep only match if the entire line matches?
(12 answers)
Closed 5 years ago.
I want to get the lines that contain a defined word with grep.
Edit: The solution was this one.
I know that I can use the -w option, but it doesn't seems to do the trick.
For example: every word that contains my defined word separated by punctuation signs is included. If I look for dogs, it will show me lines that contain not only dogs word but also cats.dogs, cats-dogs, etc.
# cat file.txt
some alphadogs dance
some cats-dogs play
none dogs dance
few dog sing
all cats.dogs shout
And with grep:
# cat file.txt | grep -w "dogs"
some cats-dogs play
none dogs dance
all cats.dogs shout
Desired output:
# cat file.txt | grep -w "dogs"
none dogs dance
Do you know any workaround that allows you to get the whole word? I've tested it with \b or \< with negative results.
Thanks,
Eudald
Use the word-boundary anchors, in any version of grep you have installed
grep '^dogs$' file.txt
An excerpt from this regular-expressions page,
Anchors
[..] Anchors do not match any characters. They match a position. ^ matches at the start of the string, and $ matches at the end of the string.[..]
Try with -x parameter:
grep -x dogs file.txt
From grep manual:
-x, --line-regexp
Select only those matches that exactly match the whole line. (-x is specified by POSIX.)
NOTE: cat is useless when you pipe its output to grep

Using grep to get 12 letter alphabet only lines

Using grep
How many 12 letter - alphabet only lines are in testing.txt?
excerpt of testing.txt
tyler1
Tanktop_Paedo
xyz2#geocities.com
milt#uole.com
justincrump
cranges10
namer#uole.com
soulfunkbrotha
timetolearnz
hotbooby#geocities.com
Fire_Crazy
helloworldad
dingbat#geocities.com
from this excerpt, I want to get a result of 2. (helloworldad, and timetolearnz)
I want to check every line and grep only those that have 12 characters in each line. I can't think of a way to do this with grep though.
For the alphabet only, I think I can use
grep [A-Za-z] testing.txt
However, how do I make it so only the characters [A-Za-z] show up in those 12 characters?
You can do it with extended regex -E and by specifying that the match is exactly {12} characters from start ^ to finish $
$ grep -E "^[A-Za-z]{12}$" testing.txt
timetolearnz
helloworldad
Or if you want to get the count -c of the lines you can use
$ grep -cE "^[A-Za-z]{12}$" testing.txt
2
grep supports whole-line match and counting, e.g.:
grep -xc '[[:alpha:]]\{12\}' testing.txt
Output:
2
The [:alpha:] character class is another way of saying [A-Za-z]. See section 3.2 of the the info pages: info grep 'Regular Expressions' 'Character Classes and Bracket Expressions' for more on this subject. Or look it up in the pdf manual online.

Find the number of occurences of certain string sequences

I want to count the number of occurences of the IP address 192.168.1.10 in a text file using grep | wc.
The command I use is:
cat ./capture.txt|grep "192.168.1.10"|wc -w
which returns 0, and I don't know why.
Here is the content of my .txt file:
give this a try:
grep -Fwo '192.168.1.10' file|wc -l
-F makes the grep take your pattern as literal string instead of regex
-w excludes 192.168.1.101 or 192.168.1.100
-o lists each match in a line. grep does line based match, if your pattern matched twice in a line, the result of occurrence count may be wrong.
cat ./capture.txt | grep "\b192\.168\.1\.10\b" -c
\. search for dot, not any character
\b match at the beginning or end of a word
-c return the number of occurrences

Resources