Detect repeated characters using grep - linux

I'm trying to write a grep (or egrep) command that will find and print any lines in "words.txt" which contain the same lower-case letter three times in a row. The three occurrences of the letter may appear consecutively (as in "mooo") or separated by one or more spaces (as in "x x x") but not separated by any other characters.
words.txt contains:
The monster said "grrr"!
He lived in an igloo only in the winter.
He looked like an aardvark.
Here's what I think the command should look like:
grep -E '\b[^ ]*[[:alpha:]]{3}[^ ]*\b' 'words.txt'
Although I know this is wrong, but I don't know enough of the syntax to figure it out. Using grep, could someone please help me?

Does this work for you?
grep '\([[:lower:]]\) *\1 *\1'
It takes a lowercase character [[:lower:]] and remembers it \( ... \). It than tries to match any number of spaces _* (0 included), the rememberd character \1, any number of spaces, the remembered character. And that's it.
You can try running it with --color=auto to see what parts of the input it matched.

Try this. Note that this will not match "mooo", as the word boundary (\b) occurs before the "m".
grep -E '\b([[:alpha:]]) *\1 *\1 *\b' words.txt
[:alpha:] is an expression of a character class. To use as a regex charset, it needs the extra brackets. You may have already known this, as it looks like you started to do it, but left the open bracket unclosed.

Related

inserting a number from stdout into a string from stdout

I'm working on a Linux terminal.
I have a string followed by a number as stdout and I need a command that replaces the middle of the string by the number and writes the result to stdout.
This is the string and number: librarian 16
and this is what the output should be: l16n
I have tried using echo librarian 16|sed s/[a-z]*/16/g and this gives me 9 999 the problems are that it replaces every letter separitaly and that it also replaces the first and last letter and that I can't make it use the number from stdout.
I have also tried using cut -c 1-1 , sed s/[^0-9]*//g and cut-c 9-9 to generate l, 16 and n respectively but I can't find how to combine their outputs into a single line.
Lastly I have tried using text editors to copy the number and paste it into the string but I haven't made much progress since I don't know how to use editors directly from the command line.
So what you want is to capture the first letter, the last letter and the number while ignoring the middle.
In regex we use ( and ) to tell the engine what we want to capture, anything else simply gets matched, or "eaten", but not captured. So the pattern should look like this:
([a-z])[a-z]*([a-z]) ([0-9]+)
([a-z]) to capture the first letter
[a-z]* to match zero or more characters but not capture. We choose "*" here because there might not be anything to match in the middle, like when there are two or less letters.
([a-z]) to capture the last letter.
to "eat" the whitespace.
([0-9]+) to capture the number. We use + instead of * because we require a number at this position.
sed uses a different syntax for some fo these constructs so we'll use the -E flag. You could do without it but you'd have to escape the ()+ characters which IMO makes pattern a little bit confusing.
Now, to retrieve the captured content, we have to use an engine-specific sequence of characters. sed uses \n where n is the number of the capturing group, so our final pattern should look like this:
\1\3\2
\1: First letter
\3: Number
\2: Last letter
Now we put everything together:
$ echo librarian 16|sed -r 's/([a-z])[a-z]*([a-z]) ([0-9]+)/\1\3\2/g'
l16n

Vim or sed : Replace character(s) within a pattern

I wanted to replace underscores with hyphens in all places where the character('_') is preceded and following by uppercase letters e.g. QWQW_IOIO, OP_FD_GF_JK, TRT_JKJ, etc. The replacement is needed throughout one document.
I tried to replace this in vim using:
:%s/[A-Z]_[A-Z]/[A-Z]-[A-Z]/g
But that resulted in QWQW_IOIO with QWQ[A-Z]-[A-Z]OIO :(
I tried using a sed command:
sed -i '/[A-Z]_[A-Z]/ s/_/-/g' ./file_name
This resulted in replacement over the whole line. e.g.
QWQW_IOIO variable may contain '_' or '-' line was replaced by
QWQW-IOIO variable may contain '-' or '-'
You had the right idea with your first vim approach. But you need to use a capturing group to remember what character was found in the [A-Z] section. Those are nicely explained here and under :h /\1. As a side note, I would recommend using \u instead of [A-Z], since it is both shorter and faster. That means the solution you want is:
:%s/\(\u\)_\(\u\)/\1-\2/g
Or, if you would like to use the magic setting to make it more readable:
:%s/\v(\u)_(\u)/\1-\2/g
Another option would be to limit the part of the search that gets replaced with the \zs and \ze atoms:
:%s/\u\zs_\ze\u/-/g
This is the shortest solution I'm aware of.
This should do what you want, assuming GNU sed.
sed -i -r -e 's/([A-Z]+)_([A-Z]+)/\1-\2/g' ./file_name
Explanation:
-r flag enables extended regex
[A-Z]+ is "one or more uppercase letters"
() groups a pattern together and creates a numbered memorized match
\1, \2 put those memorized matches in the replacement.
So basically this finds a chunk of uppercase letters followed by an underscore, followed by another chunk of uppercase letters, memorizes only the letter chunks as 2 groups,
([A-Z]+)_([A-Z]+)
Then it replays those groups, but with a hyphen in between instead of an underscore.
\1-\2
The g flag at the end says to do this even if the pattern shows up multiple times on one line.
Note that this falls apart a little in this case:
QWQW_IOIO_ABAB
Because it matches the first time, but not the second; the second part won't match because IOIO was consumed by the first match. So that would result in
QWQW-IOIO_ABAB
This version drops the + so it only matches one uppercase letter, and won't break in the same way:
sed -i -r -e 's/([A-Z])_([A-Z])/\1-\2/g'
It still has a small flaw, if you have a string like this:
A_B_C
Same issue as before, just one letter now instead of multiple.

Linux command for search substring

I want to find the word 'on' as a prefix or suffix of a string, but not where it is in the middle.
As an example,
I have a text which has words like 'on', 'one', 'cron', 'stone'. I want to find lines which contains exact word 'on' and also words like 'one' and 'cron', but it should not match stone.
I'm surprised nobody has proposed the simple, obvious
grep -E '\<on|on\>' files ...
The metacharacter sequences \< and \> match a left and right word boundary, respectively. I believe it should be portable to any modern platform (though I would be unsurprised if Solaris, HP-UX, or AIX required some tweaks in order to get it to work).
If you've got GNU grep or BSD grep, then it is relatively straight-forward:
grep -E '\b(on[[:alpha:]]*|[[:alpha:]]*on)\b'
This looks for a word boundary followed by 'on' and zero or more alphabetic characters, or for zero or more alphabetic characters followed by 'on', followed by a word boundary.
For example, given the data:
on line should be selected
cron line should be selected
stone line should not be selected
station wagon
onwards, ever onwards.
on24 is not selected
24on is not selected
Example run:
$ grep -E '\b(on[[:alpha:]]*|[[:alpha:]]*on)\b' data
on line should be selected
cron line should be selected
station wagon
onwards, ever onwards.
$
With a strict POSIX-compatible grep, you would have to work a lot harder, if it can be done at all.
Note that this solution is assuming that mixed digits and letters are not a 'word' in this context (so neither on24 nor 24on should be selected). If you don't mind digits appearing as part of a word starting or ending 'on', then you can use either of two other answers:
triplee's answer
alfasin's answer
or you can hack this one into shape so it does what one of theirs does.
You can use egrep (regex) in order to catch the exact phrases: by using \b (word boundary) you can make sure to not catch anything else other than the required 3 words:
egrep -e '\b(on|one|cron)\b' <filename>
UPDATE:
Since the question was edited & clarified that the OP is looking to have on "as a prefix or suffix of a string":
egrep -e '\bon|on\b' <filename>
If you're just going 'all out' and searching for anything with the substring 'on' in it (leaving out 'stone')...
grep '[A-Za-z]on[A-Za-z]' <your file name> | grep -v 'stone'
piping into the grep command again will hide any of the results that were 'stone'

How to make a Palindrome with a sed command?

I'm trying to find the code that searches all palindromes in a dictionary file
this is what I got atm which is wrong :
sed -rn '/^([a-z])-([a-z])\2\1$/p' /usr/share/dict/words
Can somebody explain the code as well.
Found the right answer.
sed -n '/^\([a-z]\)\([a-z]\)\2\1$/p' /usr/share/dict/words
I have no idea why I used -
I also don't have an explenation for the \ ater each group
You can use the grep command as explained here
grep -w '^\(.\)\(.\).\2\1'
explanation The grep command searches for the first any three letters by using (.)(.). after that we are searching the same 2nd character and 1st character is occuring or not.
The above grep command will find out only 5 letters palindrome words.
extended version is proposed as well on that page; and works correctly for the first line but then crashes... there is surely some good to keep and maybe to adapt...
Guglielmo Bondioni proposed a single RE that finds all palindromes up to 19 characters long using 9 subexpressions and 9 back-references:
grep -E -e '^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1' file
You can extend this further as much as you want :)
Perl to the rescue:
perl -lne 'print if $_ eq reverse' /usr/share/dict/words
Hate to say it, but while regex may be able to cook your breakfast, I don't think it can find a palindrome. According to the all-knowing Wikipedia:
In the automata theory, a set of all palindromes in a given alphabet is a typical example of a language that is context-free, but not regular. This means that it is impossible for a computer with a finite amount of memory to reliably test for palindromes. (For practical purposes with modern computers, this limitation would apply only to incredibly long letter-sequences.)
In addition, the set of palindromes may not be reliably tested by a deterministic pushdown automaton which also means that they are not LR(k)-parsable or LL(k)-parsable. When reading a palindrome from left-to-right, it is, in essence, impossible to locate the "middle" until the entire word has been read completely.
So a regular expression won't be able to solve the problem based on the problem's nature, but a computer program (or sed examples like #NeronLeVelu or #potong) will work.
explanation of your code
sed -rn '/^([a-z])-([a-z])\2\1$/p' /usr/share/dict/words
select and print line that correspond to :
A first (starting the line) small alphabetic character followed by - followed by another small alaphabetic character (could be the same as the first) followed by the last letter of the previous group followed by the first letter Letter1-Letter2Letter2Letter1 and the no other element (end of line)
sample:
a-bba
a is first letter
b second letter
b is \2
a is \1
But it's a bit strange for any work unless it came from a very specific dictionnary (limited to combination by example)
This might work for you (GNU sed):
sed -r 'h;s/[^[:alpha:]]//g;H;x;s/\n/&&/;ta;:a;s/\n(.*)\n(.)/\n\2\1\n/;ta;G;/\n(.*)\n\n\1$/IP;d' file
This copies the original string(s) to the hold space (HS), then removes everything but alpha characters from the string(s) and appends this to the HS. The second copy is then reversed and the current string(s) and the reversed copy compared. If the two strings are equal then the original string(s) is printed out otherwise the line is deleted.

Substitute `number` with `(number)` in multiple lines

I am a beginner at Vim and I've been reading about substitution but I haven't found an answer to this question.
Let's say I have some numbers in a file like so:
1
2
3
And I want to get:
(1)
(2)
(3)
I think the command should resemble something like :s:\d\+:........ Also, what's the difference between :s/foo/bar and :s:foo:bar ?
Thanks
Here is an alternative, slightly less verbose, solution:
:%s/^\d\+/(&)
Explanation:
^ anchors the pattern to the beginning of the line
\d is the atom that covers 0123456789
\+ matches one or more of the preceding item
& is a shorthand for \0, the whole match
Let me address those in reverse.
First: there's no difference between :s/foo/bar and :s:foo:bar; whatever delimiter you use after the s, vim will expect you to use from then on. This can be nice if you have a substitution involving lots of slashes, for instance.
For the first: to do this to the first number on the current line (assuming no commas, decimal places, etc), you could do
:s:\(\d\+\):(\1)
The \(...\) doesn't change what is matched - rather, it tells vim to remember whatever matched what is inside, and store it. The first \(...\) is stored in \1, the second in \2, etc. So, when you do the replacement, you can reference \1 to get the number back.
If you want to change ALL numbers on the current line, change it to
:s:\(\d\+\):(\1):g
If you want to change ALL numbers on ALL lines, change it to
:%s:\(\d\+\):(\1):g
You can do what you want with:
:%s/\([0-9]\)/(\1)/
%s means global search and replace, that is do the search/replace for every line in the file. the \( \) defines a group, which in turn is referenced by \1. So the above search and replace, finds all lines with a single digit ([0-9]), and replaces it with the matched digit surrounded by parentheses.

Resources