how to use wget spider to identify broken urls from a list of urls and save broken ones

how to use wget spider to identify broken urls from a list of urls and save broken ones - linux

I am trying to write a shell script to identify broken urls from a list of urls.
here is input_url.csv sample:
https://www.google.com/
https://www.nbc.com
https://www.google.com.hksjkhkh/
https://www.google.co.jp/
https://www.google.ca/
Here is what I have which works:
wget --spider -nd -nv -H --max-redirect 0 -o run.log -i input_url.csv
and this gives me '2019-09-03 19:48:37 URL: https://www.nbc.com 200 OK' for valid urls, and for broken ones it gives me '0 redirections exceeded.'
what i expect is that i only want to save those broken links into my output file.
sample expect output:
https://www.google.com.hksjkhkh/

I think I would go with:
<input.csv xargs -n1 -P10 sh -c 'wget --spider --quiet "$1" || echo "$1"' --
You can use -P <count> option to xargs to run count processes in parallel.
xargs runs the command sh -c '....' -- for each line of the input file appending the input file line as the argument to the script.
Then sh inside runs wget ... "$1". The || checks if the return status is nonzero, which means failure. On wget failure, echo "$1" is executed.
Live code link at repl.
You could filter the output of wget -nd -nv and then regex the output, well like
wget --spider -nd -nv -H --max-redirect 0 -i input 2>&1 | grep -v '200 OK' | grep 'unable' | sed 's/.* .//; s/.$//'
but this looks not expendable, is not parallel so probably is slower and probably not worth the hassle.

Related

How to store output for every xargs instance separately

cat domains.txt | xargs -P10 -I % ffuf -u %/FUZZ -w wordlist.txt -o output.json
Ffuf is used for directory and file bruteforcing while domains.txt contains valid HTTP and HTTPS URLs like http://example.com, http://example2.com. I used xargs to speed up the process by running 10 parallel instances. But the problem here is I am unable to store output for each instance separately and output.json is getting override by every running instance. Is there anything we can do to make output.json unique for every instance so that all data gets saved separately. I tried ffuf/$(date '+%s').json instead but it didn't work either.

Sure. Just name your output file using the domain. E.g.:
xargs -P10 -I % ffuf -u %/FUZZ -w wordlist.txt -o output-%.json < domains.txt
(I dropped cat because it was unnecessary.)
I missed the fact that your domains.txt file is actually a list of URLs rather than a list of domain names. I think the easiest fix is just to simplify domains.txt to be just domain names, but you could also try something like:
xargs -P10 -I % sh -c 'domain="%"; ffuf -u %/FUZZ -w wordlist.txt -o output-${domain##*/}.json' < domains.txt

cat domains.txt | xargs -P10 -I % sh -c "ping % > output.json.%"
Like this and your "%" can be part of the file name. (I changed your command to ping for my testing)
So maybe something more like this:
cat domains.txt | xargs -P10 -I % sh -c "ffuf -u %/FUZZ -w wordlist.txt -o output.json.%
"
I would replace your ffuf command with the following script, and call this from the xargs command. It just strips out the invalid file name characters and replaces them with a dot then runs the command:
#!/usr/bin/bash
URL=$1
FILE="`echo $URL | sed 's/:\/\//\./g'`"
ffuf -u ${URL}/FUZZ -w wordlist.txt -o output-${FILE}.json

remove duplicate lines in wget output

I want to remove duplicate lines in wget output.
I use this code
wget -q "http://www.sawfirst.com/selena-gomez" -O -|tr ">" "\n"|grep 'selena-gomez-'|cut -d\" -f2|cut -d\# -f1|while read url;do wget -q "$url" -O -|tr ">" "\n"|grep 'name=.*content=.*jpg'|cut -d\' -f4|sort |uniq;done
And output like this
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
I want to remove duplicate lines of output.

Better try :
mech-dump --images "http://www.sawfirst.com/selena-gomez" |
grep -i '\.jpg$' |
sort -u
Package libwww-mechanize-perl for Debian and derivatives.
Output:
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-404.jpg
...

In some cases, tools like Beautiful Soup become more appropriate.
Trying to do this with only wget & grep becomes an interesting exercise, this is my naive try but I am very sure are better ways of doing it
$ wget -q "http://www.sawfirst.com/selena-gomez" -O -|
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez" |
while read url; do
if [[ $url == *jpg ]]
then
echo $url
else
wget -q $url -O - |
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez" |
grep "\.jpg$" &
fi
done | sort -u > selena-gomez
In the first round:
wget -q "http://www.sawfirst.com/selena-gomez" -O -|
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez"
URLs matching the desired name will be extracted, in the while loop could be the case that the $url is already ending with .jpg therefore it will be only printed instead of fetching the content again.
This approach just goes deep 1 level, and to try to speed up things it uses & ad the end with the intention to do multiple requests in parallel:
grep "\.jpg$" &
Need to check if the & lock or wait for all background jobs to finish
It ends with sort -u to return a unique list of items found.

wget does not terminate

I have the following problem with my code:
After the downloads are all finished the script does not terminate. It seems to wait for more urls.
My code:
#!/bin/bash
cd "$1"
test=$(wget -qO- "$3" | grep --line-buffered "tarball_url" | cut -d '"' -f4)
echo test:
echo $test
echo ==============
wget -nd -N -q --trust-server-names --content-disposition -i- ${test}
An example for $test:
https://api.github.com/repos/matrixssl/matrixssl/tarball/3-9-1-open https://api.github.com/repos/matrixssl/matrixssl/tarball/3-9-0-open

-i means to get the list of URLs from a file, and using - in place of the file means to get them from standard input. So it's waiting for you to type the URLs.
If $test contains the URLs, you don't need to use -i, just list the URLs on the command line:
wget -nd -N -q --trust-server-names --content-disposition $test

Need response time and download time for the URLs and write shell scripts for same

I have use command to get response time :
curl -s -w "%{time_total}\n" -o /dev/null https://www.ziffi.com/suggestadoc/js/ds.ziffi.https.v308.js
and I also need download time of this below mentioned js file link so used wget command to download this file but i get multiple parameter out put. I just need download time from it
$ wget --output-document=/dev/null https://www.ziffi.com/suggestadoc/js/ds.ziffi.https.v307.js
please suggest

I think what you are looking for is this:
wget --output-document=/dev/null https://www.ziffi.com/suggestadoc/js/ds.ziffi.https.v307.js 2>&1 >/dev/null | grep = | awk '{print $5}' | sed 's/^.*\=//'
Explanation:
2>&1 >/dev/null | --> Makes sure stderr gets piped instead of stdout
grep = --> select the line that contains the '=' symbol
sed 's/^.*\=//' --> deletes everything from linestart to the = symbol

wget and bash error: bash: line 0: fg: no job control

I am trying to run a series of commands in parallel through xargs. I created a null-separated list of commands in a file cmd_list.txt and then attempted to run them in parallel with 6 threads as follows:
cat cmd_list.txt | xargs -0 -P 6 -I % bash -c %
However, I get the following error:
bash: line 0: fg: no job control
I've narrowed down the problem to be related to the length of the individual commands in the command list. Here's an example artificially-long command to download an image:
mkdir a-very-long-folder-de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8
wget --no-check-certificate --no-verbose -O a-very-long-folder-de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8/blah.jpg http://d4u3lqifjlxra.cloudfront.net/uploads/example/file/48/accordion.jpg
Just running the wget command on its own, without the file list and without xargs, works fine. However, running this command at the bash command prompt (again, without the file list) fails with the no job control error:
echo "wget --no-check-certificate --no-verbose -O a-very-long-folder-de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8/blah.jpg http://d4u3lqifjlxra.cloudfront.net/uploads/example/file/48/accordion.jpg" | xargs -I % bash -c %
If I leave out the long folder name and therefore shorten the command, it works fine:
echo "wget --no-check-certificate --no-verbose -O /tmp/blah.jpg http://d4u3lqifjlxra.cloudfront.net/uploads/example/file/48/accordion.jpg" | xargs -I % bash -c %
xargs has a -s (size) parameter that can change the max size of the command line length, but I tried increasing it to preposterous sizes (e.g., 16000) without any effect. I thought that the problem may have been related to the length of the string passed in to bash -c, but the following command also works without trouble:
bash -c "wget --no-check-certificate --no-verbose -O a-very-long-folder-de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8/blah.jpg http://d4u3lqifjlxra.cloudfront.net/uploads/example/file/48/accordion.jpg"
I understand that there are other options to run commands in parallel, such as the parallel command (https://stackoverflow.com/a/6497852/1410871), but I'm still very interested in fixing my setup or at least figuring out where it's going wrong.
I'm on Mac OS X 10.10.1 (Yosemite).

It looks like the solution is to avoid the -I parameter for xargs which, per the OS X xargs man page, has a 255-byte limit on the replacement string. Instead, the -J parameter is available, which does not have a 255-byte limit.
So my command would look like:
echo "wget --no-check-certificate --no-verbose -O a-very-long-folder-de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8/blah.jpg http://d4u3lqifjlxra.cloudfront.net/uploads/example/file/48/accordion.jpg" | xargs -J % bash -c %
However, in the above command, only the portion of the replacement string before the first whitespace is passed to bash, so bash tries to execute:
wget
which obviously results in an error. My solution is to ensure that xargs interprets the commands as null-delimited instead of whitespace-delimited using the -0 parameter, like so:
echo "wget --no-check-certificate --no-verbose -O a-very-long-folder-de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8de090952623b4865c2c34bd6330f8a423ed05ed8/blah.jpg http://d4u3lqifjlxra.cloudfront.net/uploads/example/file/48/accordion.jpg" | xargs -0 -J % bash -c %
and finally, this works!
Thank you to #CharlesDuffy who provided most of this insight. And no thank you to my OS X version of xargs for its poor handling of replacement strings that exceed the 255-byte limit.

I suspect it's the percent symbol, and your top shell complaining.
cat cmd_list.txt | xargs -0 -P 6 -I % bash -c %
Percent is a metacharacter for job control. "fg %2", e.g. "kill %4".
Try escaping the percents with a backslash to signal to the top shell that it should not try to interpret the percent, and xargs should be handed a literal percent character.
cat cmd_list.txt | xargs -0 -P 6 -I \% bash -c \%

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

how to use wget spider to identify broken urls from a list of urls and save broken ones - linux

Related

How to store output for every xargs instance separately

remove duplicate lines in wget output

wget does not terminate

Need response time and download time for the URLs and write shell scripts for same

wget and bash error: bash: line 0: fg: no job control

Categories

Resources