how to egrep after using egrep -o [duplicate] - linux

This question already has answers here:
Capturing Groups From a Grep RegEx
(10 answers)
Closed 3 years ago.
I have a file called random.html with the following line(not the only line):
blahblahblahblah random="whatever h45" blahblahblahblah
I want to specifically only get whatever, so far i used the following:
egrep -o 'random="([a-z]*[A-Z]*[0-9]*[ ]*)+'
This gives me random="whatever h45
I cant use just egrep -o ="([a-z]*[A-Z]*[0-9]*[ ]*)+' to begin with because this is not my only line and there will be unwanted lines, the random keyword is important for distinction purposes. I tried to do a double egrep -o such as:
egrep -o 'random="([a-z]*[A-Z]*[0-9]*[ ]*)+' | egrep -o '="([a-z]*[A-Z]*[0-9]*[ ]*)+'
Where it would just display ="whatever h45 but that doesn't work. Am i doing something wrong or is this illegal? I don't want to use anything fancy or use cut. This is supposed to be very "basic".

You can do this in bash alone as well:
while read -r; do
[[ $REPLY =~ random=\"([a-zA-Z0-9]+) ]] || continue
echo ${BASH_REMATCH[1]}
done < file.txt
If your version of grep supports Perl regexes, you can use lookback assertions to match only text that follows random=".
grep -P -o '(?<=random=\")([a-zA-Z0-9]+)' file.txt

You're just using the wrong tool, this is trivial in awk. There's various solutions, here's one:
$ cat file
blahblahblahblah random="whatever h45" blahblahblahblah
$ awk 'match($0,/random="([a-z]*[A-Z]*[0-9]*[ ]*)+/) { print substr($0,RSTART+8,RLENGTH-8) }' file
whatever h45
It wasn't clear from your question if you wanted whatever or whatever h45 or ="whatever h45 or some other part of the string printed, so I just picked the one I thought most likely. Whichever it is, it's trivial...
By the way, your regexp doesn't seem to make sense, I just copied it from your question to ease the contrast between what you had and the awk solution. if you tell us in words what it's meant to represent we can write it correctly for you but I THINK the most likely thing is that it should just be non-double-quote, e.g.:
$ awk 'match($0,/random="[^"]+/) { print substr($0,RSTART+8,RLENGTH-8) }' file
whatever h45

Perl solution for completeness.
#% perl -n -e 'print $1, "\n" if m!random="(\S+)!' tt
gives
whatever
whatever
where tt is
#% cat tt
blahblahblahblah random="whatever h45" blahblahblahblah
blahblahblahblah random="whatever h45" blahblahblahblah

Related

How to find words from one file in another file?

In one text file, I have 150 words. I have another text file, which has about 100,000 lines.
How can I check for each of the words belonging to the first file whether it is in the second or not?
I thought about using grep, but I could not find out how to use it to read each of the words in the original text.
Is there any way to do this using awk? Or another solution?
I tried with this shell script, but it matches almost every line:
#!/usr/bin/env sh
cat words.txt | while read line; do
if grep -F "$FILENAME" text.txt
then
echo "Se encontró $line"
fi
done
Another way I found is:
fgrep -w -o -f "words.txt" "text.txt"
You can use grep -f:
grep -Ff "first-file" "second-file"
OR else to match full words:
grep -w -Ff "first-file" "second-file"
UPDATE: As per the comments:
awk 'FNR==NR{a[$1]; next} ($1 in a){delete a[$1]; print $1}' file1 file2
Use grep like this:
grep -f firstfile secondfile
SECOND OPTION
Thank you to Ed Morton for pointing out that the words in the file "reserved" are treated as patterns. If that is an issue - it may or may not be - the OP can maybe use something like this which doesn't use patterns:
File "reserved"
cat
dog
fox
and file "text"
The cat jumped over the lazy
fox but didn't land on the
moon at all.
However it did land on the dog!!!
Awk script is like this:
awk 'BEGIN{i=0}FNR==NR{res[i++]=$1;next}{for(j=0;j<i;j++)if(index($0,res[j]))print $0}' reserved text
with output:
The cat jumped over the lazy
fox but didn't land on the
However it did land on the dog!!!
THIRD OPTION
Alternatively, it can be done quite simply, but more slowly in bash:
while read r; do grep $r secondfile; done < firstfile

How to check if the sed command replaced some string? [duplicate]

This question already has answers here:
How to check if sed has changed a file
(11 answers)
Closed 7 years ago.
This command replaces the old string with the new one if the one exists.
sed "s/$OLD/$NEW/g" "$source_filename" > $dest_filename
How can I check if the replacement happened ? (or how many times happened ?)
sed is not the right tool if you need to count the substitution, awk will fit better your needs :
awk -v OLD=foo -v NEW=bar '
($0 ~ OLD) {gsub(OLD, NEW); count++}1
END{print count " substitutions occured."}
' "$source_filename"
This latest solution counts only the number of lines substituted. The next snippet counts all substitutions with perl. This one has the advantage to be clearer than awk and we keep the syntax of sed substitution :
OLD=foo NEW=bar perl -pe '
$count += s/$ENV{OLD}/$ENV{NEW}/g;
END{print "$count substitutions occured.\n"}
' "$source_filename"
Edit
Thanks to william who had found the $count += s///g trick to count the number of substitutions (even or not on the same line)
This awk should count the total number of substitutions instead of the number of lines where substitutions took place:
awk 'END{print t, "substitutions"} {t+=gsub(old,new)}1' old="foo" new="bar" file
If it is free for you to choose other tool, like awk, (as #sputnick suggested), go with other tools. Awk could count how many times the pattern matched.
sed itself cannot count replacement, particularly if you use /g flag. however if you want to stick to sed and know the replacement times there is possibilities:
One way is
grep -o 'pattern'|wc -l file && sed 's/pattern/rep/g' oldfile > newfile
you could also do it with tee
cat file|tee >(grep -o 'pattern'|wc -l)|(sed 's/pattern/replace/g' >newfile)
see this small example:
kent$ cat file
abababababa
aaaaaa
xaxaxa
kent$ cat file|tee >(grep -o 'a'|wc -l)|(sed 's/a/-/g' >newfile)
15
kent$ cat newfile
-b-b-b-b-b-
------
x-x-x-
this worked for me.
awk -v s="OLD" -v c="NEW" '{count+=gsub(s,c); }1
END{print count "numbers"}
' opfilename

Grep Syntax with Capitals

I'm trying to write a script with a file as an argument that greps the text file to find any word that starts with a capital and has 8 letters following it. I'm bad with syntax so I'll show you my code, I'm sure it's an easy fix.
grep -o '[A-Z][^ ]*' $1
I'm not sure how to specify that:
a) it starts with a capital letter, and
b)that it's a 9 letter word.
Cheers
EDIT:
As an edit I'd like to add my new code:
while read p
do
echo $p | grep -Eo '^[A-Z][[:alpha:]]{8}'
done < $1
I still can't get it to work, any help on my new code?
'[A-Z][^ ]*' will match one character between A and Z, followed by zero or more non-space characters. So it would match any A-Z character on its own.
Use \b to indicate a word boundary, and a quantifier inside braces, for example:
grep '\b[A-Z][a-z]\{8\}\b'
If you just did grep '[A-Z][a-z]\{8\}' that would match (for example) "aaaaHellosailor".
I use \{8\}, the braces need to be escaped unless you use grep -E, also known as egrep, which uses Extended Regular Expressions. Vanilla grep, that you are using, uses Basic Regular Expressions. Also note that \b is not part of the standard, but commonly supported.
If you use ^ at the beginning and $ at the end then it will not find "Wiltshire" in "A Wiltshire pig makes great sausages", it will only find lines which just consist of a 9 character pronoun and nothing else.
This works for me:
$ echo "one-Abcdefgh.foo" | grep -o -E '[A-Z][[:alpha:]]{8}'
$ echo "one-Abcdefghi.foo" | grep -o -E '[A-Z][[:alpha:]]{8}'
Abcdefghi
$
Note that this doesn't handle extensions or prefixes. If you want to FORCE the input to be a 9-letter capitalized word, we need to be more explicit:
$ echo "one-Abcdefghij.foo" | grep -o -E '\b[A-Z][[:alpha:]]{8}\b'
$ echo "Abcdefghij" | grep -o -E '\b[A-Z][[:alpha:]]{8}\b'
$ echo "Abcdefghi" | grep -o -E '\b[A-Z][[:alpha:]]{8}\b'
Abcdefghi
$
I have a test file named 'testfile' with the following content:
Aabcdefgh
Babcdefgh
cabcdefgh
eabcd
Now you can use the following command to grep in this file:
grep -Eo '^[A-Z][[:alpha:]]{8}' testfile
The code above is equal to:
cat testfile | grep -Eo '^[A-Z][[:alpha:]]{8}'
This matches
Aabcdefgh
Babcdefgh

Bash Sorting STDIN

I want to write a bash script that sorts the input by rules in different files. The first rule is to write all chars or strings in file1. The second rule is to write all numbers in file2. The third rule is to write all alphanumerical strings in file3. All specials chars must be ignored. Because I am not familiar with bash I don t know how to realize this.
Could someone help me?
Thanks,
Haniball
Thanks for the answers,
I wrote this script,
#!/bin/bash
inp=0 echo "Which filename for strings?"
read strg
touch $strg
echo "Which filename for nums?"
read nums
touch $nums
echo "Which filename for alphanumerics?"
read alphanums
touch $alphanums
while [ "$inp" != "quit" ]
do
echo "Input: "
read inp
echo $inp | grep -o '\<[a-zA-Z]+>' > $strg
echo $inp | grep -o '\<[0-9]>' > $nums
echo $inp | grep -o -E '\<[0-9]{2,}>' > $nums
done
After I ran it, it only writes string in the stringfile.
Greetings, Haniball
Sure can help. See here:
How To Ask Questions The Smart Way
Help Vampires: A Spotter’s Guide
cool site about the bash is here: http://wiki.bash-hackers.org/doku.php
for sorting try man sort
for pattern matching try man grep
other useful tools: man sed man awk man strings man tee
And it is always correct tag your homework as "homework" ;)
You can try something like:
<input_file strings -1 -a | tee chars_and_strings.txt |\
grep "^[A-Za-z0-9][A-Za-z0-9]*$" | tee alphanum.txt |\
grep "^[0-9][0-9]*$" > numonly.txt
The above is only for USA - no international (read unicode) chars, where things coming a little bit more complicated.
grep is sufficient (your question is a bit vague. If I got something wrong, let me know...)
Using the following input file:
this is a string containing words,
single digits as in 1 and 2 as well
as whole numbers 42 1066
all chars or strings
$ grep -o '\<[a-zA-Z]\+\>' sorting_input
this
is
a
string
containing
words
single
digits
as
in
and
as
well
all single digit numbers
$ grep -o '\<[0-9]\>' sorting_input
1
2
all multiple digit numbers
$ grep -o -E '\<[0-9]{2,}\>' sorting_input
42
1066
Redirect the output to a file, i.e. grep ... > file1
Bash really isn't the best language for this kind of task. While possible, ild highly recommend the use of perl, python, or tcl for this.
That said, you can write all of stdin from input to a temporary file with shell redirection. Then, use a command like grep to output matches to another file. It might look something like this.
#!/bin/bash
cat > temp
grep pattern1 > file1
grep pattern2 > file2
grep pattern3 > file3
rm -f temp
Then run it like this:
cat file_to_process | ./script.sh
I'll leave the specifics of the pattern matching to you.

Linux using grep to print the file name and first n characters

How do I use grep to perform a search which, when a match is found, will print the file name as well as the first n characters in that file? Note that n is a parameter that can be specified and it is irrelevant whether the first n characters actually contains the matching string.
grep -l pattern *.txt |
while read line; do
echo -n "$line: ";
head -c $n "$line";
echo;
done
Change -c to -n if you want to see the first n lines instead of bytes.
You need to pipe the output of grep to sed to accomplish what you want. Here is an example:
grep mypattern *.txt | sed 's/^\([^:]*:.......\).*/\1/'
The number of dots is the number of characters you want to print. Many versions of sed often provide an option, like -r (GNU/Linux) and -E (FreeBSD), that allows you to use modern-style regular expressions. This makes it possible to specify numerically the number of characters you want to print.
N=7
grep mypattern *.txt /dev/null | sed -r "s/^([^:]*:.{$N}).*/\1/"
Note that this solution is a lot more efficient that others propsoed, which invoke multiple processes.
There are few tools that print 'n characters' rather than 'n lines'. Are you sure you really want characters and not lines? The whole thing can perhaps be best done in Perl. As specified (using grep), we can do:
pattern="$1"
shift
n="$2"
shift
grep -l "$pattern" "$#" |
while read file
do
echo "$file:" $(dd if="$file" count=${n}c)
done
The quotes around $file preserve multiple spaces in file names correctly. We can debate the command line usage, currently (assuming the command name is 'ngrep'):
ngrep pattern n [file ...]
I note that #litb used 'head -c $n'; that's neater than the dd command I used. There might be some systems without head (but they'd pretty archaic). I note that the POSIX version of head only supports -n and the number of lines; the -c option is probably a GNU extension.
Two thoughts here:
1) If efficiency was not a concern (like that would ever happen), you could check $status [csh] after running grep on each file. E.g.: (For N characters = 25.)
foreach FILE ( file1 file2 ... fileN )
grep targetToMatch ${FILE} > /dev/null
if ( $status == 0 ) then
echo -n "${FILE}: "
head -c25 ${FILE}
endif
end
2) GNU [FSF] head contains a --verbose [-v] switch. It also offers --null, to accomodate filenames with spaces. And there's '--', to handle filenames like "-c". So you could do:
grep --null -l targetToMatch -- file1 file2 ... fileN |
xargs --null head -v -c25 --

Resources