Search for string within html link on webpage and download the linked file

Search for string within html link on webpage and download the linked file - linux

I am trying to write a linux script to search for a link on a web page and download the file from that link...
the webpage is:
http://ocram.github.io/picons/downloads.html
The link I am interested in is:
"hd.reflection-black.7z"
The original way I was doing this was using these commands..
lynx -dump -listonly http://ocram.github.io/picons/downloads.html &> output1.txt
cat output1.txt | grep "17" &> output2.txt
cut -b 1-6 --complement output2.txt &> output3.txt
wget -i output3.txt
I am hoping there is an easier way to search the webpage for the link "hd.reflection-black.7z" and save the linked file.
The files are stored on google drive which does not contain the filename in the url, hence the use of "17" in second line of code above..

#linuxnoob, if you to download the file (curl is more powerfull than wget):
curl -L --compressed `(curl --compressed "http://ocram.github.io/picons/downloads.html" 2> /dev/null | \
grep -o '<a .*href=.*>' | \
sed -e 's/<a /\n<a /g' | \
grep hd.reflection-black.7z | \
sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d')` > hd.reflection-black.7z
without indentation, for your script:
curl -L --compressed `(curl --compressed "http://ocram.github.io/picons/downloads.html" 2> /dev/null | grep -o '<a .*href=.*>' | sed -e 's/<a /\n<a /g' | grep hd.reflection-black.7z | sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d')` > hd.reflection-black.7z 2>/dev/null
You can try it!

What about?
curl --compressed "http://ocram.github.io/picons/downloads.html" | \
grep -o '<a .*href=.*>' | \
sed -e 's/<a /\n<a /g' | \
grep hd.reflection-black.7z | \
sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

I'd try to avoid using regular expressions since they tend to break in unexpected ways (e.g. the output is split in more than one line for some reason).
I suggest to use a scripting language like Ruby or Python, where higher level tools are available.
The following example is in Ruby:
#!/usr/bin/ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
main_url = ARGV[0] # 'http://ocram.github.io/picons/downloads.html'
filename = ARGV[1] # 'hd.reflection-black.7z'
doc = Nokogiri::HTML(open(main_url))
url = doc.xpath("//a[text()='#{filename}']").first['href']
File.open(filename,'w+') do |file|
open(url,'r' ) do |link|
IO.copy_stream(link,file)
end
end
Save it to a file like fetcher.rb and then you can use it with
ruby fetcher.rb http://ocram.github.io/picons/downloads.html hd.reflection-black.7z
To make it work you'll have to install Ruby and the Nokogiri library (both are available on most distro's repositories)

Related

linux shell script stops after first line

I try to execute etherwake based on a MQTT topic.
The output of mosquitto_sub stops if I pipe it in a while statement.
works:
# mosquitto_sub -L mqtt://... | grep -o -E '([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}'
00:00:00:00:de:ad
00:00:00:00:be:ef
00:00:00:00:ca:fe
(goes on and on)
does not work:
mosquitto_sub -L mqtt://... \
| grep -o -E '([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}' \
| hexdump
Output stops after a single line:
0000000 1234 5678 9abc def0 abcd cafe 3762 3a65
The big picture is this one:
mosquitto_sub -L mqtt://... \
| grep -o -E '([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}' \
| while read macaddr; do
echo "send WOL to " $macaddr;
/usr/bin/etherwake -D -b "$macaddr" 2>&1;
done
Usually I am fine with the Linux shell but this time it simply gets stuck after the first line.
My guess is there is some problem with stdin or stdout (is not read or full etc.) in some kind. But I am out ideas.
By the way its an OpenWRT shell so an ash and no bash.

The problem is indeed the "buffering" of grep when used with pipes.
Usually the '--line-buffered' switch should be used to force grep to process the data line by line instead of buffer the data.
Because grep on OpenWRT (busybox) does not have this switch 'awk' is used:
mosquitto_sub -L mqtt://... \
| awk '/([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}/{ print $0 }' \
| hexdump
If there is no busybox version of grep used the solution would be like:
mosquitto_sub -L mqtt://... \
| grep -o --line-buffered -E '([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}' \
| hexdump
Thank you all a lot for your help.

remove duplicate lines in wget output

I want to remove duplicate lines in wget output.
I use this code
wget -q "http://www.sawfirst.com/selena-gomez" -O -|tr ">" "\n"|grep 'selena-gomez-'|cut -d\" -f2|cut -d\# -f1|while read url;do wget -q "$url" -O -|tr ">" "\n"|grep 'name=.*content=.*jpg'|cut -d\' -f4|sort |uniq;done
And output like this
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
I want to remove duplicate lines of output.

Better try :
mech-dump --images "http://www.sawfirst.com/selena-gomez" |
grep -i '\.jpg$' |
sort -u
Package libwww-mechanize-perl for Debian and derivatives.
Output:
http://www.sawfirst.com/wp-content/uploads/2018/03/Selena-Gomez-12.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-760.jpg
http://www.sawfirst.com/wp-content/uploads/2018/02/Selena-Gomez-404.jpg
...

In some cases, tools like Beautiful Soup become more appropriate.
Trying to do this with only wget & grep becomes an interesting exercise, this is my naive try but I am very sure are better ways of doing it
$ wget -q "http://www.sawfirst.com/selena-gomez" -O -|
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez" |
while read url; do
if [[ $url == *jpg ]]
then
echo $url
else
wget -q $url -O - |
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez" |
grep "\.jpg$" &
fi
done | sort -u > selena-gomez
In the first round:
wget -q "http://www.sawfirst.com/selena-gomez" -O -|
grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" |
grep -i "selena-gomez"
URLs matching the desired name will be extracted, in the while loop could be the case that the $url is already ending with .jpg therefore it will be only printed instead of fetching the content again.
This approach just goes deep 1 level, and to try to speed up things it uses & ad the end with the intention to do multiple requests in parallel:
grep "\.jpg$" &
Need to check if the & lock or wait for all background jobs to finish
It ends with sort -u to return a unique list of items found.

xpath html combine columns

I'm trying to extract data from socks-proxy.net with the IP and port from the website table.
I'm using these commands in linux to get the IP and port. How can I combine theme?
wget -q -O - "https://socks-proxy.net" | xmllint --html --xpath "//table[#id=\"proxylisttable\"]//tr//td[1]//text()" - 2>/dev/null
Output:
103.254.12.3393.12.55.94192:12:44:11
It combines the IP and it its not good
that will get all the IP's from the website table
wget -q -O - "https://socks-proxy.net" | xmllint --html --xpath "//table[#id=\"proxylisttable\"]//tr//td[2]//text()" - 2>/dev/null
that will get all the ports
Output:
108025951082
It combines the port and its not good.
Question: how can I combine them with the desired example output:
103.254.12.33:1080
93.12.55.94:2595
192:12:44:11:1082
and so on...

A bit late, but seeing you're using 4(!) different tools to accomplish something so simple I just had to jump in to show you another amazing XML parser, called xidel, which can do it all by itself:
$ xidel -s "https://socks-proxy.net" -e '
//table[#class="table table-striped table-bordered"]/tbody/tr/x"my{td[5]}://{td[1]}:{td[2]}"
'
mySocks4://103.254.126.130:1080
mySocks5://192.228.194.87:25950
mySocks5://173.162.95.122:62168
mySocks4://183.166.22.194:1080
mySocks5://70.44.216.252:40656
[...]

Complex solution:
wget -q -O - "https://socks-proxy.net" \
| xmllint --html --xpath "//table[#id='proxylisttable']//tr//td[position() < 3]" - 2>/dev/null
| tidy -cq -omit -f /dev/null | xmllint --html --xpath "//td/text()" - | paste - - -d':'
The output:
103.254.126.130:1080
192.228.194.87:25950
173.162.95.122:62168
183.166.22.194:1080
70.44.216.252:40656
66.83.161.74:34036
37.191.146.151:10200
101.100.171.69:52769
120.92.164.154:62080
216.37.80.226:61226
75.180.14.170:17694
74.221.106.14:10200
208.180.142.167:14846
...
Extended approach to cover additional fields:
wget -q -O - "https://socks-proxy.net" \
| xmllint --html --xpath "//table[#id='proxylisttable']//tr//td[position() < 3]" - 2>/dev/null
| tidy -cq -omit -f /dev/null | xmllint --html --xpath "//td/text()" - \
| awk -F'\n' -v RS= '{ for(i=1;i<=NF;i+=5) printf "my%s://%s:%s\n",$(i+4),$i,$(i+1) }'
The output:
mySocks4://103.254.126.130:1080
mySocks5://192.228.194.87:25950
mySocks5://173.162.95.122:62168
mySocks4://183.166.22.194:1080
mySocks5://70.44.216.252:40656
mySocks5://66.83.161.74:34036
mySocks5://37.191.146.151:10200
mySocks5://101.100.171.69:52769
mySocks5://120.92.164.154:62080
....
P.S. Tested on your input file given by https://pastebin.com/F14VRNBc.

Add some text to each line of txt file and pass to wget

I have a file called : filename.txt contains file name with extension
I want to add url before each line like www.abc.com/
and pass it to wget like :
cat filename.txt | xargs -n 1 -P 16 wget -q -P /location
Thanks

Sounds like you want to prefix each line in filename.txt with a string:
sed -e 's#^#www.abc.com/#' filename.txt

I got my answer, thanks to all for your valuable response
awk '{print "https://<URL>" $0;}' filename.txt | xargs -n 1 -P 16 wget -q -P /location

how to echo the filename?

I'm searching in a .docx content with this command:
unzip -p *.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep $1
But I need the name of file which contains the word what I searched. How can I do it?

You can walk through the files via for cycle:
for file in *.docx; do
unzip -p "$file" word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g' | grep PATTERN && echo $file
done
The && echo $file part prints the filename when grep finds the pattern.

Try with:
find . -name "*your_file_name*" | xargs grep your_word | cut -d':' -f1

If you're using GNU grep (likely, as you're on Linux), you might want to use this option:
--label=LABEL
Display input actually coming from standard input as input coming from file LABEL. This is especially useful when implementing tools like zgrep, e.g., gzip -cd foo.gz | grep --label=foo -H something. See
also the -H option.
So you'd have something like
for f in *.docx
do unzip -p "$f" word/document.xml \
| sed -e "$sed_command" \
| grep -H --label="$f" "$1"
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Search for string within html link on webpage and download the linked file - linux

What about? curl --compressed "http://ocram.github.io/picons/downloads.html" | \ grep -o '<a .href=.>' | \ sed -e 's/<a /\n<a /g' | \ grep hd.reflection-black.7z | \ sed -e 's/<a .href=['"'"'"]//' -e 's/["'"'"'].$//' -e '/^$/ d'

Related

linux shell script stops after first line

remove duplicate lines in wget output

xpath html combine columns

Add some text to each line of txt file and pass to wget

how to echo the filename?

Categories

Resources

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Search for string within html link on webpage and download the linked file - linux

What about? curl --compressed "http://ocram.github.io/picons/downloads.html" | \ grep -o '<a .*href=.*>' | \ sed -e 's/<a /\n<a /g' | \ grep hd.reflection-black.7z | \ sed -e 's/<a .*href=['"'"'"]//' -e 's/["'"'"'].*$//' -e '/^$/ d'

Related

linux shell script stops after first line

remove duplicate lines in wget output

xpath html combine columns

Add some text to each line of txt file and pass to wget

how to echo the filename?

Categories

Resources

What about? curl --compressed "http://ocram.github.io/picons/downloads.html" | \ grep -o '<a .href=.>' | \ sed -e 's/<a /\n<a /g' | \ grep hd.reflection-black.7z | \ sed -e 's/<a .href=['"'"'"]//' -e 's/["'"'"'].$//' -e '/^$/ d'