Wget - download website from second depth

Wget - download website from second depth - linux

Can I download website with wget from second depth and next sublevels?
I want to download only pages which refer from the first level, but not download the first level.
For example:
I have a domain with structure: www.domain.cz/foo/bar/baz
I want to download only pages: bar and bar/baz
It is possible?

You may download the index file on the top level, then feed it with -i index.html and optionally set --base=http://www.domain.cz and directory options like --force-directories. Depending on the content there, you might want to reduce the index file contents (extract only wanted subdirs).

Use -np.
-np
--no-parent
Do not ever ascend to the parent directory when retrieving
recursively. This is a useful option, since it guarantees that only
the files below a certain hierarchy will be downloaded.

Related

shell script to check if images in a folder are being used by a set of HTML files

Sometime ago I worked in a team that developed a bunch of educational softwares and now they are been reviewed for bugs and updates.
During this process, I noticed that the folder "imgs" accumulated too many files. Probably one of the developers decided to include all the images used by each of the softwares into the folder. However, because there are too many softwares, it would be too painful to check manually all of them (and some of the images are part of the layout, almost invisible).
Is there a way to write a shell script in Linux to check if the files in a given folder are being used by a set of HTML and JS files in another folder?

Go to the images folder and try this
for name in *; { grep -ril $name /path/to/soft/* || echo "$name not used"; }

Im not sure I understood your question correctly,
But maybe this will help you
ls -1 your_source_path | while read file
do
grep -wnr "$file*" your_destination_path ||
echo "no matching for file $file"
# you can set any extra action here
done
in source_path you put director from hi will list all files name and destination where he should searching.

It is not possible to check for the generic case - since HTML and Javascript are two dynamic (e.g. the Javascript code could create the image file name on the file). Likewise, images can be specified in CSS style sheet, inline style, etc.
You want to review the HTML/JS files, and see if possible to identify the tags that are actually used to specify images. This will hopefully, reduce the number of XML tags and attribute names that need to be extracted.
As an alternative, if you have access to the 'access log' of the server, you can find out which images have been accessed over time, and focus the search on images not referenced in the log file.

fetching a specific category / specific person photos using wget

I want to gather large picture data base for running an application. I saw wget commands for fetching pictures from websites generally, but not with a specific person's name/folder. I was trying to fetch pictures of a specific person from flicker, like this.
wget -r -A jpeg,jpg,bmp,gif,png https://www.flickr.com/search/?q=obama
thought it shows as if something is being fetched, with a lot of folders being created, but the insides are actually empty. no pictures are really being fetched. Am I doing something wrong?
does anybody know how to do this, downloading a specific persons photos from google n flicker sort of websites using wget?

By default, wget does not --span-hosts. But on Flickr the bitmap files are stored on servers with a different DNS name than www.flickr.com {something with "static" in its name typically}.
You may grep for such URLs in the files, you retrieved during your first run. Then, you shall extend the parameters to wget by --span-hosts and a corresponding list of directory names via --include-directories.
Another alternative is to follow the lines of http://blog.fedora-fr.org/shaiton/post/How-to-download-whole-Flickr-album.

Add menu item in XAMPP

I want to add a menu item in my xampp home page, right in the Demos category. I can see in here there are multiple items like CD Collection, Biorhythm, Guest Book etc..., I want to add my folder "Courses01" right before CD Collection.
How can I do that?
Here's where I want to place it
I forgot to mention, I saw this option in WAMP, I had a directory in /www folder and I could browse my directory from the browser, just click on the links. I want something similar.

In order to display the folders where you code, all you need to do is:
Stop the servers (Apache, MySql etc..)
Go to your XAMPP directory (where you've installed it)
Rename the file htdocs to *htdocs_default* (that's so you know which one's the original one...)
Create a new folder called htdocs
Store your PHP scripts here or folders and have fun!
NOTE: To access your workpath in the browser, you just type localhost (as usually).

May be it is too late but ...
You can add your menu item by modifying 2 files :
First add your items in the file name [your_langage].php. You will find this file into the directory [xampp_dir]/htdocs/lang/
Into the file (en.php for example), add the navi component you want.
For example, in your case :
$TEXT['navi-course01'] = "Course01";
after the line " $TEXT['navi-phpswitch'] = "PHP Switch";"
Save your file
Secondly, you have to modify the file phpexamples.php into the directory :
[xampp_dir]/htdocs/xampp/navilinks
Add your menu item for example :
<a class="n" target="content" onclick="h(this);" href="external/mycourse01dir/index.php"><?php echo $TEXT['navi-course01']; ?></a><br>
Save the file, restart your menu eg. http://localhost
If you want this modification in all the languages of your xampp installation, please, you have to modify all the files in the dir lang.
The menu items are shown in the same order they are ordered in the phpexample.php file
Regards

How to retrieve file from folder on on another subdomain without using full url

I have a file "area1.mysite.com/gallery/settings.php" that I need to include in "area2.mysite.com/index.php". The issue is that I cannot use the full url I need to go backwards from /area2/www/index.php to /area1/www/gallery/setting.php... Does that make scents?

Surely this could be done using relative links? so you would include
../../../area1/www/gallery/setting.php
As long as that is your correct file setup in the question... But yeah basically each ../ moves you up one folder, and then you can dive back down just like you would with a non-relative link

What are some good crawlers that can help download files

For one of my statistics project, I need to RANDOMLY download several files from a google patent page, and each file is a large zip file. The web link is the following:
http://www.google.com/googlebooks/uspto-patents-grants-text.html#2012
Specifically, I want to RANDOMLY select 5 years (The links on the top of the page) and download (i.e 5 files). DO you guys know if there's some good package out there that is good for this task?
Thank you.

That page contains mostly zip files and looking at the HTML content it seems that it should be fairly easy to determine which links will yield a zip file by simply searching for a *.zip in a collection of candidate URLs, so here is what I would recommend:
fetch the page
parse the HTML
extract the anchor tags
for each anchor tag
if href of anchor tag contaings "*.zip"
add href to list of file links
while more files needed
generate a random index i, such that i is between 0 and num links in list
select i-th element from the links list
fetch the zip file
save the file to disk or load it in memory
If you don't want to get the same file twice, then just remove the URL from the list of links and that randomly select another index (until you have enough files or until you run out of links). I don't know what programming language your team codes in, but it shouldn't be very difficult to write a small program that does the above.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Wget - download website from second depth - linux

You may download the index file on the top level, then feed it with -i index.html and optionally set --base=http://www.domain.cz and directory options like --force-directories. Depending on the content there, you might want to reduce the index file contents (extract only wanted subdirs).

Use -np. -np --no-parent Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.

Related

shell script to check if images in a folder are being used by a set of HTML files

fetching a specific category / specific person photos using wget

Add menu item in XAMPP

How to retrieve file from folder on on another subdomain without using full url

What are some good crawlers that can help download files

Categories

Resources