remove duplicate lines in wget output - linux

I want to remove duplicate lines in wget output.
I use this code
wget -q "http://www.sawfirst.com/selena-gomez" -O -|tr ">" "\n"|grep 'selena-gomez-'|cut -d\" -f2|cut -d\# -f1|while read url;do wget -q "$url" -O -|tr ">" "\n"|grep 'name=.*content=.*jpg'|cut -d\' -f4|sort |uniq;done
And output like this
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
I want to remove duplicate lines of output.

Better try :
mech-dump --images "http://www.sawfirst.com/selena-gomez" |
grep -i '\.jpg$' |
sort -u
Package libwww-mechanize-perl for Debian and derivatives.
Output:
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-404.jpg
...

In some cases, tools like Beautiful Soup become more appropriate.
Trying to do this with only wget & grep becomes an interesting exercise, this is my naive try but I am very sure are better ways of doing it
$ wget -q "http://www.sawfirst.com/selena-gomez" -O -|
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez" |
while read url; do
if [[ $url == *jpg ]]
then
echo $url
else
wget -q $url -O - |
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez" |
grep "\.jpg$" &
fi
done | sort -u > selena-gomez
In the first round:
wget -q "http://www.sawfirst.com/selena-gomez" -O -|
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez"
URLs matching the desired name will be extracted, in the while loop could be the case that the $url is already ending with .jpg therefore it will be only printed instead of fetching the content again.
This approach just goes deep 1 level, and to try to speed up things it uses & ad the end with the intention to do multiple requests in parallel:
grep "\.jpg$" &
Need to check if the & lock or wait for all background jobs to finish
It ends with sort -u to return a unique list of items found.

Related

How to store output for every xargs instance separately

cat domains.txt | xargs -P10 -I % ffuf -u %/FUZZ -w wordlist.txt -o output.json
Ffuf is used for directory and file bruteforcing while domains.txt contains valid HTTP and HTTPS URLs like http://example.com, http://example2.com. I used xargs to speed up the process by running 10 parallel instances. But the problem here is I am unable to store output for each instance separately and output.json is getting override by every running instance. Is there anything we can do to make output.json unique for every instance so that all data gets saved separately. I tried ffuf/$(date '+%s').json instead but it didn't work either.
Sure. Just name your output file using the domain. E.g.:
xargs -P10 -I % ffuf -u %/FUZZ -w wordlist.txt -o output-%.json < domains.txt
(I dropped cat because it was unnecessary.)
I missed the fact that your domains.txt file is actually a list of URLs rather than a list of domain names. I think the easiest fix is just to simplify domains.txt to be just domain names, but you could also try something like:
xargs -P10 -I % sh -c 'domain="%"; ffuf -u %/FUZZ -w wordlist.txt -o output-${domain##*/}.json' < domains.txt
cat domains.txt | xargs -P10 -I % sh -c "ping % > output.json.%"
Like this and your "%" can be part of the file name. (I changed your command to ping for my testing)
So maybe something more like this:
cat domains.txt | xargs -P10 -I % sh -c "ffuf -u %/FUZZ -w wordlist.txt -o output.json.%
"
I would replace your ffuf command with the following script, and call this from the xargs command. It just strips out the invalid file name characters and replaces them with a dot then runs the command:
#!/usr/bin/bash
URL=$1
FILE="`echo $URL | sed 's/:\/\//\./g'`"
ffuf -u ${URL}/FUZZ -w wordlist.txt -o output-${FILE}.json

how to echo the filename?

I'm searching in a .docx content with this command:
unzip -p *.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep $1
But I need the name of file which contains the word what I searched. How can I do it?
You can walk through the files via for cycle:
for file in *.docx; do
unzip -p "$file" word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep PATTERN && echo $file
done
The && echo $file part prints the filename when grep finds the pattern.
Try with:
find . -name "*your_file_name*" | xargs grep your_word | cut -d':' -f1
If you're using GNU grep (likely, as you're on Linux), you might want to use this option:
--label=LABEL
Display input actually coming from standard input as input coming from file LABEL. This is especially useful when implementing tools like zgrep, e.g., gzip -cd foo.gz | grep --label=foo -H something. See
also the -H option.
So you'd have something like
for f in *.docx
do unzip -p "$f" word/document.xml \
| sed -e "$sed_command" \
| grep -H --label="$f" "$1"
done

wget does not terminate

I have the following problem with my code:
After the downloads are all finished the script does not terminate. It seems to wait for more urls.
My code:
#!/bin/bash
cd "$1"
test=$(wget -qO- "$3" | grep --line-buffered "tarball_url" | cut -d '"' -f4)
echo test:
echo $test
echo ==============
wget -nd -N -q --trust-server-names --content-disposition -i- ${test}
An example for $test:
https://api.github.com/repos/matrixssl/matrixssl/tarball/3-9-1-open https://api.github.com/repos/matrixssl/matrixssl/tarball/3-9-0-open
-i means to get the list of URLs from a file, and using - in place of the file means to get them from standard input. So it's waiting for you to type the URLs.
If $test contains the URLs, you don't need to use -i, just list the URLs on the command line:
wget -nd -N -q --trust-server-names --content-disposition $test

How do you use wget to download most up to date file on a site?

Hello I am trying to use wget to download the most update to day McAfee patch and I am having issues singling out the .tar file. This is what I have:
wget -q -O - ftp://ftp.mcafee.com/pub/antivirus/datfiles/4.x/ | grep -o -m 2 "avvdat-[^\']*"
However when I run the above command it gives me:
avvdat-8065.tar">avvdat-8065.tar</a> (95191040 bytes)
avvdat-8066.tar">avvdat-8066.tar</a> (95385600 bytes)
When I need it to just be the most recent.tar file in between the <a> </a> which in this case would be avvdat-8066.tar. Can someone please help me out with greping the correct .tar I am not too good with regex or sed.
Try this,
wget $(wget -q -O - ftp://ftp.mcafee.com/pub/antivirus/datfiles/4.x/ | grep -Eo "ftp://[^\"\]+" | sort | tail -n1)
I'd suggest modifying your grep regex so it retrieves only the file name, then using sort to sort the results and tail to discard all but the last one.
wget -q -O - ftp://ftp.mcafee.com/pub/antivirus/datfiles/4.x/ | grep -o -m 2 "avvdat-[^\'\"]*" | sort | tail -1

Bash - Command call ported to variable with another variable inside

I believe this is a simple syntax issue on my part but I have been unable to find another example similar to what i'm trying to do. I have a variable taking in a specific disk location and I need to use that location in an hdparm /grep command to pull out the max LBA
targetDrive=$1 #/dev/sdb
maxLBA=$(hdparm -I /dev/sdb |grep LBA48 |grep -P -o '(?<=:\s)[^\s]*') #this works perfect
maxLBA=$(hdparm -I $1 |grep LBA48 |grep -P -o '(?<=:\s)[^\s]*') #this fails
I have also tried
maxLBA=$(hdparm -I 1 |grep LBA48 |grep -P -o '(?<=:\s)[^\s]*')
maxLBA=$(hdparm -I "$1" |grep LBA48 |grep -P -o '(?<=:\s)[^\s]*')
Thanks for the help
So I think here is the solution to your problem. I did basically the same as you but changed the way I pipe the results into one another.
grep with regular expression to find the line containing LBA48
cut to retrieve the second field when the resulting string is divided by the column ":"
then trim all the leasding spaces from the result
Here is my resulting bash script.
#!/bin/bash
target_drive=$1
max_lba=$(sudo hdparm -I "$target_drive" | grep -P -o ".+LBA48.+:.+(\d+)" | cut -d: -f2 | tr -d ' ')
echo "Drive: $target_drive MAX LBA48: $max_lba"

Resources