Exclude directories from wget to create sitemap

Exclude directories from wget to create sitemap - linux

Im trying to use a shell script to scrape a website to get a list of all pages. I found the shell script "Written by Elmar Hanlhofer https://www.plop.at" and it works well. However, I need to exclude directories and the documentation isn't working for me.
# Example, exclude files from /print and /slide:
# files=$(find | grep -i html | grep -v "$SITE/print" | grep -v "$SITE/slide")
I need to exclude a forum install located at /support (and all children directories) so I modify the code to be:
files=$(find | grep -i html | grep -v "$SITE/support")
However it is still scanning /support/directory/directory/ etc. How do I modify the grep command to exclude /support AND ALL CHILD DIRECTORIES?
I am very new to linux / unix commands, so I may not be expressing this correctly. Thank you.

The original script will download whole site, run find to filter out content you don't want.
The section related to wget is copied below,
wget \
--recursive \
--no-clobber \
--page-requisites \
--convert-links \
--restrict-file-names=windows \
--no-parent \
--directory-prefix="$TMP" \
--domains $DOMAIN \
--user-agent="$AGENT" \
$URL >& $WGET_LOG
To exclude support directory, add --exclude-directories option,
wget \
--recursive \
--no-clobber \
--page-requisites \
--convert-links \
--restrict-file-names=windows \
--no-parent \
--directory-prefix="$TMP" \
--domains $DOMAIN \
--user-agent="$AGENT" \
--exclude-directories=/support \
$URL >& $WGET_LOG
Read this answer if you want to have more control on directories.

Related

How to login into website using wget method POST with redirect

What I'm looking for
I want to scrape a website and have an alert as soon as there are changes.
But I have a little trouble to understand the authentication mechanism set up by this website.
I tried almost all tricks to be able to connect via wget, but none worked.
The website : offre.astria.com
What I tried
username=XXXXX
password=XXXXX
wget --save-cookies cookies.txt \
--keep-session-cookies \
--post-data 'username=$username&password=$password' \
--delete-after \
https://cas.astria.com/cas-ext/login?service=https://offre.astria.com
wget --load-cookies cookies.txt \
https://offre.astria.com
After your comments, I tried this too :
username=XXXX
password=XXXX
code=`wget -qO- https://cas.astria.com/cas-ext/login?service=https://offre.astria.com | cat | grep 'name="lt"' | cut -d"_" -f2`
hidden_code=_$code
wget --save-cookies cookies.txt \
--keep-session-cookies \
--post-data 'username=$username&password=$password&lt=$hidden_code&_eventId=submit' \
--delete-after \
https://cas.astria.com/cas-ext/login?service=https://offre.astria.com
wget --load-cookies cookies.txt \
https://offre.astria.com
And the error message goes from http 302 to http 500 servlet internal error.
Maybe, because the hidden field change the value between the first attempt and the second ...
I tried this too
username=XXXX
password=XXXX
code=`wget -qO- --save-cookies cookies.txt --keep-session-cookies --delete-after https://cas.astria.com/cas-ext/login?service=https://offre.astria.com | cat | grep 'name="lt"' | cut -d"_" -f2,3 | cut -d"\"" -f1`
wget --load-cookies cookies.txt \
--post-data 'username=$username&password=$password&lt=$hidden_code&_eventId=submit' \
--delete-after \
https://cas.astria.com/cas-ext/login?service=https://offre.astria.com
wget --load-cookies cookies.txt \
https://offre.astria.com
And same results :s

If you look at the source of the web page, you will find the login form has a hidden field:
<input type="hidden" name="lt" value="_c72D26258-FFFE-C561-D3EC-9BAC7F85B9A4_kCB5ECD54-57E0-1FD5-8A2F-BF72E114C604" />
The value of this field is generated server side, and changes every time you refresh the page.
This is a mechanism to prevent exactly what you are trying to do.
You may be able to bypass it, by having a script read the page, extract the value, and then use wget to submit a proper form.
It is however possible that you will encounter other defensive methods on the site.

How to convert curl into python

I am reading the instagram api , there is command of curl , I have been trying to read through what curl -F means, but somehow I am getting nowhere...
Appreciate if anyone can provide any insight on this topic.
curl -F 'client_id=CLIENT_ID' \
-F 'client_secret=CLIENT_SECRET' \
-F 'grant_type=authorization_code' \
-F 'redirect_uri=AUTHORIZATION_REDIRECT_URI' \
-F 'code=CODE' \
https://api.instagram.com/oauth/access_token

-F is for --form, but for most curl tasks I will often enter the curl into a web converter.
Check out https://curl.trillworks.com/
Looks like there is a library that does it too, but I've not used it: https://github.com/spulec/uncurl

Mailgun API cURL Using HTML, Does it accept head and body tags?

I am trying to send an email using Mailgun API through curl, I have an html email so all I did was change text to html from the sample given. I added my template but for some reason I notice it doesn't accept head and body tags, right?
I have the url with the newsletter please help me http://bit.ly/10B63PU
Here on the documentation the sample code for sending with api:
http://documentation.mailgun.com/quickstart-sending.html#send-via-api

I learned I had to use --form-string not just replace text with html. The example below is directly from mailgun, before they only listed the text version for sending mail.
http://documentation.mailgun.com/api-sending.html#examples
curl -s --user 'api:key-3ax6xnjp29jd6fds4gc373sgvjxteol0' \
https://api.mailgun.net/v2/samples.mailgun.org/messages \
-F from='Excited User <me#samples.mailgun.org>' \
-F to='foo#example.com' \
-F cc='bar#example.com' \
-F bcc='baz#example.com' \
-F subject='Hello' \
-F text='Testing some Mailgun awesomness!' \
--form-string html='<html>HTML version of the body</html>' \
-F attachment=#files/cartman.jpg \
-F attachment=#files/cartman.png

You need to specify the HTML payload to pass HTML, otherwise it will assume it is plain text.
curl -s --user 'api:key-3ax6xnjp29jd6fds4gc373sgvjxteol0' \
https://api.mailgun.net/v2/samples.mailgun.org/messages \
-F from='Excited User <me#samples.mailgun.org>' \
-F to=baz#example.com \
-F to=bar#example.com \
-F subject='Hello' \
-F text='Plain text version of my awesome newsletter' \
-F html='<body>My awesome newsletter</body>'
Thank you for asking the question - it might not seem as obvious!

Command Line Tool to Disable PDF Printing

Does anyone know of a "FREE" command line tool that can lock a pdf from a user being able to print it. I need to be able to put this in a batch to loop through a folder and disable printing from adobe standard and reader. Is this possible to do it from command line with any tool?

First, pdftk:
You can use pdftk for (available for Linux, Unix, Mac OS X and Windows) to set an "owner password":
pdftk \
input.pdf \
output semi-protected.pdf \
owner_pw "supersecret"
Result is this, for example:
pdfinfo semi-protected.pdf | grep Encrypted:
Encrypted: yes (print:no copy:no change:no addNotes:no)
You can modify the command to additionally require a user password to open the PDF:
pdftk \
input.pdf \
output semi-semi-protected.pdf \
owner_pw "supers3cr3t" \
user_pw "s3cr3t"
You can modify the command to (selectively) "allow" other user actions:
pdftk \
input.pdf \
output semi-semi-protected.pdf \
owner_pw "supers3cr3t" \
allow ModifyContents \
allow CopyContents \
allow ScreenReaders \
allow ModifyAnnotations
Result may be this, for example:
pdfinfo semi-semi-protected.pdf | grep Encrypted:
Encrypted: yes (print:no copy:yes change:yes addNotes:yes)
Second, podofoencrypt:
Commandline example:
podofoencrypt \
--rc4v2 \
-o "supers3cr3t" \
-u "s3cr3t" \
--edit \
--copy \
--editnotes \
--fillandsign \
--accessible \
--assemble \
input.pdf \
semi-protected.pdf
Big, fat caveat:
You should be aware, that this way of 'protecting' PDF files is by no means super-secure. There are quite a lot of PDF cracker software utilities out there which easily un-protect your PDF files. This method is only a very basic means to prevent most noobie computer users to mess with your files.
In addition, see also
Third, qpdf:
in Martin Schröder's answer!

qpdf can do this:
qpdf \
--encrypt \
"user-password" \
"owner-password" \
40 \
--print=n \
-- \
infilename \
outfilename
or even
qpdf \
--encrypt \
"user-password" \
"owner-password" \
128 \
--print=non \
--accessiblity=y \
--force-V4 \
--modify=form \
-- \
infilename \
outfilename

make my autodownloading shell script better

So I want to download multiple files from rapidshare. This what I currently have. I created a cookie by running-
wget \
--save-cookies ~/.cookies/rapidshare \
--post-data "login=USERNAME&password=PASSWORD" \
--no-check-certificate \
-O - \
https://ssl.rapidshare.com/cgi-bin/premiumzone.cgi \
> /dev/null
and now I have a shell script which I run which looks like this-
#!/bin/bash
wget -c --load-cookies ~/.cookies/rapidshare http://rapidshare.com/files/219920856/file1.rar
wget -c --load-cookies ~/.cookies/rapidshare http://rapidshare.com/files/393839302/file2.rar
wget -c --load-cookies ~/.cookies/rapidshare http://rapidshare.com/files/398293204/file3.rar
....
I want two things-
The shell script needs to read the files to download from a file.
The shell script should download anywhere from 2 - 8 files at a time.
Thanks!

When you want parallel jobs, think make.
#!/usr/bin/make -f
login:
wget -qO/dev/null \
--save-cookies ~/.cookies/rapidshare \
--post-data "login=USERNAME&password=PASSWORD" \
--no-check-certificate \
https://ssl.rapidshare.com/cgi-bin/premiumzone.cgi
$(MAKEFILES):
%: login
wget -ca$(addsuffix .log,$(notdir $#)) \
--load-cookies ~/.cookies/rapidshare $#
#echo "Downloaded $# (log in $(addsuffix .log,$(notdir $#)))"
Save this as rsget somewhere in $PATH (make sure you use tabs and not spaces for indentation), give it chmod +x, and run
rsget -kj8 \
http://rapidshare.com/files/219920856/file1.rar \
http://rapidshare.com/files/393839302/file2.rar \
http://rapidshare.com/files/398293204/file3.rar \
...
This will log in, then wget each target. -j8 tells make to run up to 8 jobs in parallel, and -k means "keep going even if a target returned failure".
Edit
Tested with GNU Make 3.79 and 3.81.

Try this. I think it should do what you want:
#! /bin/bash
MAX_CONCURRENT=8
URL_BASE="http://rapidshare.com/files/"
cookie_file=~/.cookies/rapidshare
# do your login thing here...
[ -n "$1" -a -f "$1" ] || { echo "please provide a file containing the stuff to download"; exit 1; }
inputfile=$1
count=0
while read x; do
if [ $count -ge $MAX_CONCURRENT ]; then
count=0
wait
fi
{ wget -c --load-cookies "$cookie_file" "${URL_BASE}$x" && echo "Downloaded $x"; } &
count=$((count + 1))
done < $inputfile

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Exclude directories from wget to create sitemap - linux

Related

How to login into website using wget method POST with redirect

How to convert curl into python

Mailgun API cURL Using HTML, Does it accept head and body tags?

Command Line Tool to Disable PDF Printing

make my autodownloading shell script better

Categories

Resources