Image download with Python - python-3.x

I'm trying to download images with Python 3.9.1
Other than the first 2-3 images, all images are 1 kb in size. How do I download all pictures? Please, help me.
Sample book: http://web2.anl.az:81/read/page.php?bibid=568450&pno=1
import urllib.request
import os
bibID = input("ID: ")
first = int(input("First page: "))
last = int(input("Last page: "))
if not os.path.exists(bibID):
os.makedirs(bibID)
for i in range(first,last+1):
url=f"http://web2.anl.az:81/read/img.php?bibid={bibID}&pno={i}"
urllib.request.urlretrieve(url,f"{bibID}/{i}.jpg")

Doesn't look like there is an issue with your script. It has to do with the APIs you are hitting and the sequence required.
A GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page> just on its own doesn't seem to work right away. Instead, it returns No FILE HERE
The reason this happens is that the retrieval of the images is linked to your cookie. You first need to initiate your read session that's generated when first visiting the page and clicking the TƏSDİQLƏYIRƏM button
From what I could tell you need to do the following:
POST http://web2.anl.az:81/read/page.php?bibid=568450 with Content-Type: multipart/form-data body. It should have a single key value of approve: TƏSDİQLƏYIRƏM - this starts a session and generates a cookie for you which you have to add as a header for all of your API calls from now on.
E.g.
requests.post('http://web2.anl.az:81/read/page.php?bibid=568450', files=dict(approve='TƏSDİQLƏYIRƏM'))
Do the following in your for-loop of pages:
a. GET http://web2.anl.az:81/read/page.php?bibid=568450&pno=<page number> - page won't show up if you don't do this first
b. GET http://web2.anl.az:81/read/img.php?bibid=568450&pno=<page number> - finally get the image!

Related

How to open "partial" links using Python?

I'm working on a webscraper that opens a webpage, and prints any links within that webpage if the link contains a keyword (I will later open these links for further scraping).
For example, I am using the requests module to open "cnn.com", and then trying to parse out all href/links within that webpage. Then, if any of the links contain a specific word (such as "china"), Python should print that link.
I could just simply open the main page using requests, save all href's onto a list ('links'), and then use:
links = [...]
keyword = "china"
for link in links:
if keyword in link:
print(link)
However, the problem with this method is that the links that I originally parsed out aren't full links. For example, all links with CNBC's webpage are structured like this:
href="https://www.cnbc.com/2019/08/11/how-recession-affects-tech-industry.html"
But for CNN's page, they're written like this (not full links... they're missing the part that comes before the "/"):
href="/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
This is a problem because I'm writing more script to automatically open these links to parse them. But Python can't open
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
because it isn't a full link.
So, what is a robust solution to this (something that works for other sites too, not just CNN)?
EDIT: I know the links I wrote as an example in this post don't contain the word "China", but this these are just examples.
Try using the urljoin function from the urllib.parse package. It takes two parameters, the first is the URL of the page you're currently parsing, which serves as the base for relative links, the second is the link you found. If the link you found starts with http:// or https://, it'll return just that link, else it will resolve URL relative to what you passed as the first parameter.
So for example:
#!/usr/bin/env python3
from urllib.parse import urljoin
print(
urljoin(
"https://www.cnbc.com/",
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
)
)
# prints "https://www.cnbc.com/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
print(
urljoin(
"https://www.cnbc.com/",
"http://some-other.website/"
)
)
# prints "http://some-other.website/"

Save images from Subreddit to folder python

I have been reading through alot of documentation around Praw, bs4 and I've had a look at other peoples examples of how to do this but I just can't get anything working the way I would like. I thought it would be a pretty simple script but every example I find is either written in python2 or just doesn't work at all.
I would like a script to download the top 10 images from a given Subreddit and save them to a folder.
If anyone could point me in the write direction that would be great.
Cheers
The high level flow will look something like this -
Iterate through the top posts of your subreddit.
Extract the url of the submission.
Check if the url is an image.
Save the image to your desired folder.
Stop once you have 10 images.
Here's an example of how this could be implemented-
import urllib.request
subreddit = reddit.subreddit("aww")
count = 0
# Iterate through top submissions
for submission in subreddit.top(limit=None):
# Get the link of the submission
url = str(submission.url)
# Check if the link is an image
if url.endswith("jpg") or url.endswith("jpeg") or url.endswith("png"):
# Retrieve the image and save it in current folder
urllib.request.urlretrieve(url, f"image{count}")
count += 1
# Stop once you have 10 images
if count == 10:
break

Gigya API can get hidden comments but not the visible one

I met a very weird problem when I am trying to parse data from a JavaScript written website. Maybe because I am not an expert of web development.
Here is what happened:
I am trying to get all the comments data from The Globe and Mail. If you check its source code, there is no way to use Python and parse the comments data from the source code, everything is written in JavaScript.
However, there is a magic tool called "Gigya" API, it could return all the comments from a JS written website. Gigya getComments method
When I was using these lines of code in Python Scrapy Spider, it could return all the comments.
data = {"categoryID": self.categoryID,
"streamID": streamId,
"APIKey": self.apikey,
"callback": "foo",
"threadLimit": 1000 # assume all the articles have no more then 1000 comments
}
r = urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
comments_lst = loads(r.read().decode("utf-8"))["comments"]
However, The Globe and Mail is updating their website, all the comments posted before Nov. 28 have been hiden from the web for now. That's why on the sample url I am shoing here, you could only see 2 comments, because they were posted after Nov. 28. And these 2 new comments have been added the new feature - the "React" button.
The weird thing is, righ now when I am running my code, I can get all those hidden hundreds of comments published before Nov. 28, but cannot get the new commnets we can see on the website now.
I have tried all the Gigya comment related methods, none of them worked, other Gigya methods, do not look like something helpful...
Is there any way to solve this problem?
Or at least, do you know why, I can get all the hiden comments but cannot get visible new commnets that have the new features?
Finally, I solved the problem with Python selenium library, it's free and it's super cool.
So, it seems although in the source code of JS written website, we could not see the content, it's in fact has HTML page where we can parse the content.
First of all, I installed Firebug on Firefox, with this Addon, I'm able to see the HTML page of the url and it's very easy to help you locate the content, just search key words in Firebug
Then I wrote the code like this:
from selenium import webdriver
import time
def main():
comment_urls = [
"http://www.theglobeandmail.com/opinion/a-fascists-win-americas-moral-loss/article32753320/comments/"
]
for comment_url in comment_urls:
driver = webdriver.Firefox()
driver.get(comment_url)
time.sleep(5)
htmlSource = driver.page_source
clk = driver.find_element_by_css_selector('div.c3qHyJD')
clk.click()
reaction_counts = driver.find_elements_by_class_name('c2oytXt')
for rc in reaction_counts:
print(rc.text)
if __name__ == "__main__":
main()
The data I am parsing here are those content cannot be found in HTML page until you click the reaction image on the website. What makes selenium super cool is that click() method. After you found the element you can click, just use this method, then those generated elements will appear in the HTML and become parsable. Super cool!

python 3 requests try-except failure

I have a program which downloads pages from a site, finds links for pictures in them and downloads those pictures. If I try to run this program on a computer with fast and stable Internet connection - everything works perfectly for days and weeks. But if I try this program on a computer with slow or not stable Internet connection - I have one problem - the "try-except" module doesn't seem to work correctly.
--- this function downloads content - any content (page or picture)
def downl(self,addr,cook,head2,errmess):
global result
try:
result=requests.get(addr, cookies=cook, headers=head2)
except:
print(errmess) # error message
time.sleep(5)
return result
I sent to this function link to the page, then other function looks for picture_link in that page, and then I send to the same function (downl) picture_link. After this I save result of function (downl) as a .jpg file. As I told - on a computer with normal internet connection everything works fine. As a result I have 5, 10 or 5000 pictures on my HDD.
But let me show little example of what happens with bad internet connection. Suppose we have 2 pages and 1 picture in every page.
step 1) downloading 1st page (def downl)
step 2) taking picture_link from it
step 3) downloading picture (def downl)
step 4) saving 1st picture to hdd 1.jpg
step 5) downloading 2nd page (def downl)
step 6) taking picture_link from it
step 7) downloading picture (def downl) and receivind error message (errmess)
step 8) saving 2nd picture to hdd 2.jpg
just for example: 1st picture may be normal jpg with proper content. The second picture will be file with jpg extension but will have 2nd page as it's content (it will be usual html file, saved with wrong extension "jpg")
another words: there was problem with internet during downloadind of the second picture, the program printed an error about it (errmess), but INSTEAD of COUNTLESS retrying (as supposed in my function) it somehow PASSED through the try-except block and returned previous result (2nd page), which was saved as 2nd picture.
Please help! How to make this try-except (or requests) work FOREVER, UNTILL it downloads what it is supposed to download (no matter what mistakes happen with internet connection), and not pass through with previous result.
Thanks very much for youк time and attention.
Then you need a while True loop like this:
def downl(self,addr,cook,head2,errmess):
global result
while(True):
try:
result=requests.get(addr, cookies=cook, headers=head2)
return result
except:
print(errmess) # error message
time.sleep(5)

how to verify links in a PDF file

I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!
Example:
$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt
I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !
Updated: to make the question clear.
You can use pdf-link-checker
pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.
To install it with pip:
pip install pdf-link-checker
Unfortunately, one dependency (pdfminer) is broken. To fix it:
pip uninstall pdfminer
pip install pdfminer==20110515
I suggest first using the linux command line utility 'pdftotext' - you can find the man page:
pdftotext man page
The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.
Once installed, you could process the PDF file through pdftotext:
pdftotext file.pdf file.txt
Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:
use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;
That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:
m/http[^\s]+/i
"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.
There are two lines of enquiry with your question.
Are you looking for regex verification that the link contains key information such as http:// and valid TLD codes? If so I'm sure a regex expert will drop by, or have a look at regexlib.com which contains lots of existing regex for dealing with URLs.
Or are you wanting to verify that a website exists then I would recommend Python + Requests as you could script out checks to see if websites exist and don't return error codes.
It's a task which I'm currently undertaking for pretty much the same purpose at work. We have about 54k links to get processed automatically.
Collect links by:
enumerating links using API, or dumping as text and linkifying the result, or saving as html PDFMiner.
Make requests to check them:
there are plethora of options depending on your needs.
https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):
'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys
import PyPDF2
# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
'''extracts all urls from filename'''
PDFFile = open(filename,'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
yield u[ank][uri]
def check_http_url(url):
urllib.urlopen(url)
if __name__ == "__main__":
for url in extract_urls(sys.argv[1]):
check_http_url(url)
Save to filename.py, run as python filename.py pdfname.pdf.

Resources