I'm working on a webscraper that opens a webpage, and prints any links within that webpage if the link contains a keyword (I will later open these links for further scraping).
For example, I am using the requests module to open "cnn.com", and then trying to parse out all href/links within that webpage. Then, if any of the links contain a specific word (such as "china"), Python should print that link.
I could just simply open the main page using requests, save all href's onto a list ('links'), and then use:
links = [...]
keyword = "china"
for link in links:
if keyword in link:
print(link)
However, the problem with this method is that the links that I originally parsed out aren't full links. For example, all links with CNBC's webpage are structured like this:
href="https://www.cnbc.com/2019/08/11/how-recession-affects-tech-industry.html"
But for CNN's page, they're written like this (not full links... they're missing the part that comes before the "/"):
href="/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
This is a problem because I'm writing more script to automatically open these links to parse them. But Python can't open
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
because it isn't a full link.
So, what is a robust solution to this (something that works for other sites too, not just CNN)?
EDIT: I know the links I wrote as an example in this post don't contain the word "China", but this these are just examples.
Try using the urljoin function from the urllib.parse package. It takes two parameters, the first is the URL of the page you're currently parsing, which serves as the base for relative links, the second is the link you found. If the link you found starts with http:// or https://, it'll return just that link, else it will resolve URL relative to what you passed as the first parameter.
So for example:
#!/usr/bin/env python3
from urllib.parse import urljoin
print(
urljoin(
"https://www.cnbc.com/",
"/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
)
)
# prints "https://www.cnbc.com/2019/08/10/europe/luxembourg-france-amsterdam-tornado-intl/index.html"
print(
urljoin(
"https://www.cnbc.com/",
"http://some-other.website/"
)
)
# prints "http://some-other.website/"
Related
I'm trying to make a Python script that will retrieve PhysRevPER articles for me in html format so that I can put them on my Kindle and read them away from my desk (and internet connection). However, while the full text of the articles can be seen by going to the website, one has to click on several "Click to Expand" (or simply "+") "links" in order to see it all. These "links" don't point to new urls, so I assume they are controlling some sort of script which determines the visibility of each section of the article. Is there some way to instruct urllib to send the appropriate script instructions that will cause the "links" to expand and download the page after everything is expanded? When I simply try to retrieve the page I don't get any of the hidden text.
Here's a link to a recent article:
https://journals.aps.org/prper/abstract/10.1103/PhysRevPhysEducRes.15.010134
My stub script (takes the DOI information, 10.1103/PhysRevPhysEducRes.15.010134 for the above example, as first argument and, optionally, the name of the output file as the second):
#! /usr/bin/env python3
import urllib.request
import sys
import io.open
print('retreive url and convert to text')
url = "https://journals.aps.org/prper/abstract/" + sys.argv[1]
codec = "utf_8"
data = urllib.request.urlopen(url)
doc = data.read().decode(codec)
if len(sys.argv) == 3:
filename = sys.argv[2]
else
filename = "default.html"
a = io.open(filename, mode='wt', encoding='utf-8')
for i in range(len(information)):
information[i] = '%s\n' % information[i]
a.writelines(information)
a.close()
I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.
I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!
This is my first attempt to use web scraping in python to extract some links from a webpage.
This the webpage i am interested in getting some data from:
http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5
I am interest in extracting all the instance of following from above webpage:
href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352"
I have written following regex to extract all the matches of above type of links:
r"href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\""
Here is quick code i have written to try to extract all the regex mataches:
#!/usr/bin/python3
import re
import requests
url = "http://www.hotstar.com/tv/bhojo-gobindo/14172/seasons/season-5"
page = requests.get(url)
l = re.findall(r'href=\"(\/tv\/bhojo-gobindo\/14172\/.*\/\d{10})\"', page.text)
print(l)
When I run the above code I get following ouput:
./links2.py
[]
When I inspect the webpage using developer tools within the browser I can see this links but when I try to extract the text I am interested in(href="/tv/bhojo-gobindo/14172/gobinda-is-in-a-fix/1000196352") using python3 script I get no matches.
Am I downloading the webpage correctly, how do I make sure I am getting all of the webapage from within my script. i have a feeling I am missing parts of the web page when using the requests to get the web page.
Any help please.
I met a very weird problem when I am trying to parse data from a JavaScript written website. Maybe because I am not an expert of web development.
Here is what happened:
I am trying to get all the comments data from The Globe and Mail. If you check its source code, there is no way to use Python and parse the comments data from the source code, everything is written in JavaScript.
However, there is a magic tool called "Gigya" API, it could return all the comments from a JS written website. Gigya getComments method
When I was using these lines of code in Python Scrapy Spider, it could return all the comments.
data = {"categoryID": self.categoryID,
"streamID": streamId,
"APIKey": self.apikey,
"callback": "foo",
"threadLimit": 1000 # assume all the articles have no more then 1000 comments
}
r = urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
comments_lst = loads(r.read().decode("utf-8"))["comments"]
However, The Globe and Mail is updating their website, all the comments posted before Nov. 28 have been hiden from the web for now. That's why on the sample url I am shoing here, you could only see 2 comments, because they were posted after Nov. 28. And these 2 new comments have been added the new feature - the "React" button.
The weird thing is, righ now when I am running my code, I can get all those hidden hundreds of comments published before Nov. 28, but cannot get the new commnets we can see on the website now.
I have tried all the Gigya comment related methods, none of them worked, other Gigya methods, do not look like something helpful...
Is there any way to solve this problem?
Or at least, do you know why, I can get all the hiden comments but cannot get visible new commnets that have the new features?
Finally, I solved the problem with Python selenium library, it's free and it's super cool.
So, it seems although in the source code of JS written website, we could not see the content, it's in fact has HTML page where we can parse the content.
First of all, I installed Firebug on Firefox, with this Addon, I'm able to see the HTML page of the url and it's very easy to help you locate the content, just search key words in Firebug
Then I wrote the code like this:
from selenium import webdriver
import time
def main():
comment_urls = [
"http://www.theglobeandmail.com/opinion/a-fascists-win-americas-moral-loss/article32753320/comments/"
]
for comment_url in comment_urls:
driver = webdriver.Firefox()
driver.get(comment_url)
time.sleep(5)
htmlSource = driver.page_source
clk = driver.find_element_by_css_selector('div.c3qHyJD')
clk.click()
reaction_counts = driver.find_elements_by_class_name('c2oytXt')
for rc in reaction_counts:
print(rc.text)
if __name__ == "__main__":
main()
The data I am parsing here are those content cannot be found in HTML page until you click the reaction image on the website. What makes selenium super cool is that click() method. After you found the element you can click, just use this method, then those generated elements will appear in the HTML and become parsable. Super cool!
I have a PDF file which I want to verify whether the links in that are proper. Proper in the sense - all URLs specified are linked to web pages and nothing is broken. I am looking for a simple utility or a script which can do it easily ?!
Example:
$ testlinks my.pdf
There are 2348 links in this pdf.
2322 links are proper.
Remaining broken links and page numbers in which it appears are logged in brokenlinks.txt
I have no idea of whether something like that exists, so googled & searched in stackoverflow also. But did not find anything useful yet. So would like to anyone has any idea about it !
Updated: to make the question clear.
You can use pdf-link-checker
pdf-link-checker is a simple tool that parses a PDF document and checks for broken hyperlinks. It does this by sending simple HTTP requests to each link found in a given document.
To install it with pip:
pip install pdf-link-checker
Unfortunately, one dependency (pdfminer) is broken. To fix it:
pip uninstall pdfminer
pip install pdfminer==20110515
I suggest first using the linux command line utility 'pdftotext' - you can find the man page:
pdftotext man page
The utility is part of the Xpdf collection of PDF processing tools, available on most linux distributions. See http://foolabs.com/xpdf/download.html.
Once installed, you could process the PDF file through pdftotext:
pdftotext file.pdf file.txt
Once processed, a simple perl script that searched the resulting text file for http URLs, and retrieved them using LWP::Simple. LWP::Simple->get('http://...') will allow you to validate the URLs with a code snippet such as:
use LWP::Simple;
$content = get("http://www.sn.no/");
die "Couldn't get it!" unless defined $content;
That would accomplish what you want to do, I think. There are plenty of resources on how to write regular expressions to match http URLs, but a very simple one would look like this:
m/http[^\s]+/i
"http followed by one or more not-space characters" - assuming the URLs are property URL encoded.
There are two lines of enquiry with your question.
Are you looking for regex verification that the link contains key information such as http:// and valid TLD codes? If so I'm sure a regex expert will drop by, or have a look at regexlib.com which contains lots of existing regex for dealing with URLs.
Or are you wanting to verify that a website exists then I would recommend Python + Requests as you could script out checks to see if websites exist and don't return error codes.
It's a task which I'm currently undertaking for pretty much the same purpose at work. We have about 54k links to get processed automatically.
Collect links by:
enumerating links using API, or dumping as text and linkifying the result, or saving as html PDFMiner.
Make requests to check them:
there are plethora of options depending on your needs.
https://stackoverflow.com/a/42178474/1587329's advice was inspiration to write this simple tool (see gist):
'''loads pdf file in sys.argv[1], extracts URLs, tries to load each URL'''
import urllib
import sys
import PyPDF2
# credits to stackoverflow.com/questions/27744210
def extract_urls(filename):
'''extracts all urls from filename'''
PDFFile = open(filename,'rb')
PDF = PyPDF2.PdfFileReader(PDFFile)
pages = PDF.getNumPages()
key = '/Annots'
uri = '/URI'
ank = '/A'
for page in range(pages):
pageSliced = PDF.getPage(page)
pageObject = pageSliced.getObject()
if pageObject.has_key(key):
ann = pageObject[key]
for a in ann:
u = a.getObject()
if u[ank].has_key(uri):
yield u[ank][uri]
def check_http_url(url):
urllib.urlopen(url)
if __name__ == "__main__":
for url in extract_urls(sys.argv[1]):
check_http_url(url)
Save to filename.py, run as python filename.py pdfname.pdf.