fetching a specific category / specific person photos using wget

fetching a specific category / specific person photos using wget - linux

I want to gather large picture data base for running an application. I saw wget commands for fetching pictures from websites generally, but not with a specific person's name/folder. I was trying to fetch pictures of a specific person from flicker, like this.
wget -r -A jpeg,jpg,bmp,gif,png https://www.flickr.com/search/?q=obama
thought it shows as if something is being fetched, with a lot of folders being created, but the insides are actually empty. no pictures are really being fetched. Am I doing something wrong?
does anybody know how to do this, downloading a specific persons photos from google n flicker sort of websites using wget?

By default, wget does not --span-hosts. But on Flickr the bitmap files are stored on servers with a different DNS name than www.flickr.com {something with "static" in its name typically}.
You may grep for such URLs in the files, you retrieved during your first run. Then, you shall extend the parameters to wget by --span-hosts and a corresponding list of directory names via --include-directories.
Another alternative is to follow the lines of http://blog.fedora-fr.org/shaiton/post/How-to-download-whole-Flickr-album.

Related

shell script to check if images in a folder are being used by a set of HTML files

Sometime ago I worked in a team that developed a bunch of educational softwares and now they are been reviewed for bugs and updates.
During this process, I noticed that the folder "imgs" accumulated too many files. Probably one of the developers decided to include all the images used by each of the softwares into the folder. However, because there are too many softwares, it would be too painful to check manually all of them (and some of the images are part of the layout, almost invisible).
Is there a way to write a shell script in Linux to check if the files in a given folder are being used by a set of HTML and JS files in another folder?

Go to the images folder and try this
for name in *; { grep -ril $name /path/to/soft/* || echo "$name not used"; }

Im not sure I understood your question correctly,
But maybe this will help you
ls -1 your_source_path | while read file
do
grep -wnr "$file*" your_destination_path ||
echo "no matching for file $file"
# you can set any extra action here
done
in source_path you put director from hi will list all files name and destination where he should searching.

It is not possible to check for the generic case - since HTML and Javascript are two dynamic (e.g. the Javascript code could create the image file name on the file). Likewise, images can be specified in CSS style sheet, inline style, etc.
You want to review the HTML/JS files, and see if possible to identify the tags that are actually used to specify images. This will hopefully, reduce the number of XML tags and attribute names that need to be extracted.
As an alternative, if you have access to the 'access log' of the server, you can find out which images have been accessed over time, and focus the search on images not referenced in the log file.

A Study on the Modification of PDF in nodejs

Project Environment
The environment we are currently developing is using Windows 10. nodejs 10.16.0, express web framework. The actual environment being deployed is the Linux Ubuntu server and the rest is the same.
What technology do you want to implement?
The technology that I want to implement is the information that I entered when I joined the membership. For example, I want to automatically put it in the input text box using my name, age, address, phone number, etc. so that the user only needs to fill in the remaining information in the PDF. (PDF is on some of the webpages.)
If all the information is entered, the PDF is saved and the document is sent to another vendor, which is the end.
Current Problems
We looked at about four days for PDFs, and we tried to create PDFs when we implemented the outline, structure, and code, just like it was on this site at https://web.archive.org/web/20141010035745/http://gnupdf.org/Introduction_to_PDF
However, most PDFs seem to be compressed into flatDecode rather than this simple. So I also looked at Data extraction from /Filter /FlateDecode PDF stream in PHP and tried to decompress it using QPDF.
Unzip it for now.Well, I thought it would be easy to find out the difference compared to the PDF without Kim after putting it in the first name.
However, there is too much difference even though only three characters are added... And the PDF structure itself is more difficult and complex to proceed with.
Note : https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf (PDF official document in English)
Is there a way to solve the problem now?

It sounds like you want to create a PDF from scratch and possibly extract data from it and you are finding this a more difficult prospect than you first imagined.
Check out my answer here on why PDF creation and reading is non-trivial and why you should reach for a tool you help you do this:
https://stackoverflow.com/a/53357682/1669243

How can I extract info from downloaded webpages?

I have to extract info like college name, contact number, email ids etc in a systematic order from thousands of webpages. Is there any script for doing it?
Complete scenario: I downloaded webpages containing individual colleges info using wget. There are about 5000 webpages about each college containing information of about them, however I am interested in just their name, email ids, website and contact numbers. And I need to have the extracted info saved in a suitable file in systematic order.
How can one extract info? How can I use grep to do it? Is there any better way of doing it? What scripts are available for pulling info?
PS: I use Ubuntu and Kali linux. I am a newbie. Need expert's help.

I assume you have all files in one directory, "cd" to that and:
grep -i -e "Name" -e "email" -e "http" ./*
and improve that when you see the result. That will write into your screen, finally add after that:
>> my_collected_things.txt
to get it into afile.

Wget - download website from second depth

Can I download website with wget from second depth and next sublevels?
I want to download only pages which refer from the first level, but not download the first level.
For example:
I have a domain with structure: www.domain.cz/foo/bar/baz
I want to download only pages: bar and bar/baz
It is possible?

You may download the index file on the top level, then feed it with -i index.html and optionally set --base=http://www.domain.cz and directory options like --force-directories. Depending on the content there, you might want to reduce the index file contents (extract only wanted subdirs).

Use -np.
-np
--no-parent
Do not ever ascend to the parent directory when retrieving
recursively. This is a useful option, since it guarantees that only
the files below a certain hierarchy will be downloaded.

Programmatically downloading a large number of <insert file type here>

I'm wondering if there's an easy way to download a large number of files of one arbitrary type, e.g., downloading 10,000 XML files. In the past, I've used Bing's API. It's free and offers unlimited queries. However, it doesn't index as many types of files as Google does. Google indexes XML files, CSV files, and KML files. (These can all be found by doing searches like "filetype:XML".) As far as I know, Bing doesn't index these in a way that's easily searchable. Is there another API that has these capabilities?

How about using wget? You can give wget a URL (for example, a google search result) and tell it to follow all the links on that page and download them (I bet you could also give it a filter).
Just tried it and got an ERROR 403: Forbidden. Apparently Google blocks requests from Wget. You'll have to provide a different user agent. Quick search provided this example:
http://www.mail-archive.com/wget#sunsite.dk/msg06564.html
Then it worked with the example given.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string