grep - exclude certain domains from url search

grep - exclude certain domains from url search - string

data_file.txt contains URLs, something like:
bunch of data http://good1.com/contact
lines of non-url data
bunch of data http://ok.ip.add.rss/page/1
lines of non-url data
bunch of data http://spammer.com/spammers/are/lame
lines of non-url data
bunch of data http://good2.com/page2
lines of non-url data
bunch of data http://good1.com/contact
Some are good URLs, some are spammer URLs. I'm trying to find all the spammer URLs.
I can find the good URLs with:
grep -n -o -P 'http://(good1.com|ok.ip.add.rss|good2.com).{0,80}' data_file.txt
I would like to reverse that, finding anything that is not good. I tried these variants:
grep -n -o -P 'http://*(^(good1.com|ok.ip.add.rss|good2.com)).{0,80}' data_file.txt
grep -n -o -P 'http://*^(good1.com|ok.ip.add.rss|good2.com).{0,80}' data_file.txt
grep -n -o -P 'http://*(^good1.com|^ok.ip.add.rss|^good2.com).{0,80}' data_file.txt
grep -n -o -P 'http://*(^good1.com\|^ok.ip.add.rss\|^good2.com).{0,80}' data_file.txt
grep -n -o -P 'http://*(^(good1.com|ok.ip.add.rss|good2.com)).{0,80}' data_file.txt
...but those didn't work. Any ideas?

I was able to do this with a double grep:
grep -n -o -P "http://.*?[^/'\\\\)<]*" data_file.txt | grep -v "http://good1.com\|http://good2.com\|http://ok.ip.add.rss"
I had various characters--besides slashes--following the domains, hence the [^/'\\\\)<]

Related

How to store output for every xargs instance separately

cat domains.txt | xargs -P10 -I % ffuf -u %/FUZZ -w wordlist.txt -o output.json
Ffuf is used for directory and file bruteforcing while domains.txt contains valid HTTP and HTTPS URLs like http://example.com, http://example2.com. I used xargs to speed up the process by running 10 parallel instances. But the problem here is I am unable to store output for each instance separately and output.json is getting override by every running instance. Is there anything we can do to make output.json unique for every instance so that all data gets saved separately. I tried ffuf/$(date '+%s').json instead but it didn't work either.

Sure. Just name your output file using the domain. E.g.:
xargs -P10 -I % ffuf -u %/FUZZ -w wordlist.txt -o output-%.json < domains.txt
(I dropped cat because it was unnecessary.)
I missed the fact that your domains.txt file is actually a list of URLs rather than a list of domain names. I think the easiest fix is just to simplify domains.txt to be just domain names, but you could also try something like:
xargs -P10 -I % sh -c 'domain="%"; ffuf -u %/FUZZ -w wordlist.txt -o output-${domain##*/}.json' < domains.txt

cat domains.txt | xargs -P10 -I % sh -c "ping % > output.json.%"
Like this and your "%" can be part of the file name. (I changed your command to ping for my testing)
So maybe something more like this:
cat domains.txt | xargs -P10 -I % sh -c "ffuf -u %/FUZZ -w wordlist.txt -o output.json.%
"
I would replace your ffuf command with the following script, and call this from the xargs command. It just strips out the invalid file name characters and replaces them with a dot then runs the command:
#!/usr/bin/bash
URL=$1
FILE="`echo $URL | sed 's/:\/\//\./g'`"
ffuf -u ${URL}/FUZZ -w wordlist.txt -o output-${FILE}.json

Grep: using with -e and -o options together

Need to parse big log file in one run and print id, address and service_name of found requests. The problem is that service_name is in request body that is quite big.
If I list all patterns with -e option -
grep -e 'ID: [0-9]\+' -e 'Address: .*' -e ':Body><[^ ]*'
the full request body will be printed.
What is needed is
grep -e 'ID: [0-9]\+' -e 'Address: .*' -o ':Body><[^ ]*'
or
grep -o 'ID: [0-9]\+' -o 'Address: .*' -o ':Body><[^ ]*'
to print only first word from request body that is name of the service;
but in this case grep: :Body><[^ ]*: No such file or directory error received
UPD: solution with -oe and regex works, but as it turned out -o significantly slows the operation

If you wish to print only bits of a file that match these 3 regular expressions and those three are never on the same line, you may use \|, which is grep's logical or :
grep -o 'ID: [0-9]\+\|Address: .*\|:Body><[^ ]*' my.log

Without seeing your log its difficult to fully understand what your example will return. You might try -Eo and see if that helps get what you want. You may need to adjust the regex accordingly. -E should at least resolve your "grep: :Body><[^ ]*: No such file or directory" error you're receiving.
grep -Eo 'ID: [0-9]\+' -Eo 'Address: .*' -Eo ':Body><[^ ]*' myLog.log

remove duplicate lines in wget output

I want to remove duplicate lines in wget output.
I use this code
wget -q "http://www.sawfirst.com/selena-gomez" -O -|tr ">" "\n"|grep 'selena-gomez-'|cut -d\" -f2|cut -d\# -f1|while read url;do wget -q "$url" -O -|tr ">" "\n"|grep 'name=.*content=.*jpg'|cut -d\' -f4|sort |uniq;done
And output like this
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
I want to remove duplicate lines of output.

Better try :
mech-dump --images "http://www.sawfirst.com/selena-gomez" |
grep -i '\.jpg$' |
sort -u
Package libwww-mechanize-perl for Debian and derivatives.
Output:
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-404.jpg
...

In some cases, tools like Beautiful Soup become more appropriate.
Trying to do this with only wget & grep becomes an interesting exercise, this is my naive try but I am very sure are better ways of doing it
$ wget -q "http://www.sawfirst.com/selena-gomez" -O -|
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez" |
while read url; do
if [[ $url == *jpg ]]
then
echo $url
else
wget -q $url -O - |
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez" |
grep "\.jpg$" &
fi
done | sort -u > selena-gomez
In the first round:
wget -q "http://www.sawfirst.com/selena-gomez" -O -|
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez"
URLs matching the desired name will be extracted, in the while loop could be the case that the $url is already ending with .jpg therefore it will be only printed instead of fetching the content again.
This approach just goes deep 1 level, and to try to speed up things it uses & ad the end with the intention to do multiple requests in parallel:
grep "\.jpg$" &
Need to check if the & lock or wait for all background jobs to finish
It ends with sort -u to return a unique list of items found.

How to store value from grep output in variable

I am working on one bash script, in which I have to use the regular expression to match string and then store the output in a variable to reuse it.
here is my script,
#!/bin/sh
NAME="MET-3-get-code-from-string"
por="$($NAME | grep -P -o -e '(?<=MET-).*?(\d+)')" #this should store 3 in variable por
echo $por
I tried this many ways, but I am getting error :
./check.sh: MET-3-get-issue-id-from-branch-name: not found
if I run individual grep command then yes, it is working properly. But I am not able to store output.
I also tried :
por=$($NAME | grep -P -o -e '(?<=MET-).*?(\d+)')
por=$NAME | grep -P -o -e '(?<=MET-).*?(\d+)'
and many other similar references.
but it's not working. can anyone please help me on this. I have not much experience in bash.
thank you.

Change
por="$($NAME | grep -P -o -e '(?<=MET-).*?(\d+)')"
to
por="$(echo "$NAME" | grep -P -o -e '(?<=MET-).*?(\d+)')"
Also, you are missing a closing double quote (maybe just a typo, should be NAME="MET-3-get-code-from-string")

Bash - Command call ported to variable with another variable inside

I believe this is a simple syntax issue on my part but I have been unable to find another example similar to what i'm trying to do. I have a variable taking in a specific disk location and I need to use that location in an hdparm /grep command to pull out the max LBA
targetDrive=$1 #/dev/sdb
maxLBA=$(hdparm -I /dev/sdb |grep LBA48 |grep -P -o '(?<=:\s)[^\s]*') #this works perfect
maxLBA=$(hdparm -I $1 |grep LBA48 |grep -P -o '(?<=:\s)[^\s]*') #this fails
I have also tried
maxLBA=$(hdparm -I 1 |grep LBA48 |grep -P -o '(?<=:\s)[^\s]*')
maxLBA=$(hdparm -I "$1" |grep LBA48 |grep -P -o '(?<=:\s)[^\s]*')
Thanks for the help

So I think here is the solution to your problem. I did basically the same as you but changed the way I pipe the results into one another.
grep with regular expression to find the line containing LBA48
cut to retrieve the second field when the resulting string is divided by the column ":"
then trim all the leasding spaces from the result
Here is my resulting bash script.
#!/bin/bash
target_drive=$1
max_lba=$(sudo hdparm -I "$target_drive" | grep -P -o ".+LBA48.+:.+(\d+)" | cut -d: -f2 | tr -d ' ')
echo "Drive: $target_drive MAX LBA48: $max_lba"

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

grep - exclude certain domains from url search - string

I was able to do this with a double grep: grep -n -o -P "http://.?[^/'\\\\)<]" data_file.txt | grep -v "http://good1.com\|http://good2.com\|http://ok.ip.add.rss" I had various characters--besides slashes--following the domains, hence the [^/'\\\\)<]

Related

How to store output for every xargs instance separately

Grep: using with -e and -o options together

remove duplicate lines in wget output

How to store value from grep output in variable

Bash - Command call ported to variable with another variable inside

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

grep - exclude certain domains from url search - string

I was able to do this with a double grep: grep -n -o -P "http://.*?[^/'\\\\)<]*" data_file.txt | grep -v "http://good1.com\|http://good2.com\|http://ok.ip.add.rss" I had various characters--besides slashes--following the domains, hence the [^/'\\\\)<]

Related

How to store output for every xargs instance separately

Grep: using with -e and -o options together

remove duplicate lines in wget output

How to store value from grep output in variable

Bash - Command call ported to variable with another variable inside

Categories

Resources

I was able to do this with a double grep: grep -n -o -P "http://.?[^/'\\\\)<]" data_file.txt | grep -v "http://good1.com\|http://good2.com\|http://ok.ip.add.rss" I had various characters--besides slashes--following the domains, hence the [^/'\\\\)<]