Checking if robots.txt file exists in python3 - python-3.x

I want to check a URL for the existence of robots.txt file. I found out about urllib.robotparser in python 3 and tried getting the response. But I can't find a way to return the status code (or just true/false existance) of robotss.txt
from urllib import parse
from urllib import robotparser
def get_url_status_code():
URL_BASE = 'https://google.com/'
parser = robotparser.RobotFileParser()
parser.set_url(parse.urljoin(URL_BASE, 'robots.txt'))
parser.read()
# I want to return the status code
print(get_url_status_code())

This isn't too hard to do if you're okay using the requests module which is highly recommended
import requests
def status_code(url):
r = requests.get(url)
return r.status_code
print(status_code('https://github.com/robots.txt'))
print(status_code('https://doesnotexist.com/robots.txt'))
Otherwise, if you want to avoid using a GET request, you could use a HEAD.
def does_url_exist(url):
return requests.head(url).status_code < 400
Better yet,
def does_url_exist(url):
try:
r = requests.head(url)
if r.status_code < 400:
return True
else:
return False
except requests.exceptions.RequestException as e:
print(e)
# handle your exception

Related

difficulty reading online xml content with python, xml.etree.ElementTree and urllib

I am reading XML online in an RSS feed using python, xml.etree.ElementTree and urllib.
My code seems to be straightforward but is not giving me the results that I want
No matter what I do it always returns what looks like all the data in the XML stream
I am open to better suggestions on how to read specific strings into lists
see my code below
import xml.etree.ElementTree as ET
from urllib import request
title_list = []
def main():
try:
response = request.urlopen("https://www.abcdefghijkl.xml")
rsp_code = response.code
print(rsp_code)
if rsp_code == 200:
webdata = response.read()
print("1")
xml = webdata.decode('UTF-8')
print("2")
tree = ET.parse(xml)
print("3")
items = tree.findall('channel')
print("4")
for item in items:
title = item.find('title').text
title_list.append(title)
print(f"title_list 0 is, {title_list}")
print("5")
except Exception as e:
print(f'An error occurred {str(e)}')
main()
Thanks, everyone, I figured it out after an awesome Udemy video. I eventually used the bs4 library(beautiful soup)python library and requests. Heres the code below
import bs4
import requests
title_list = []
def main():
try:
result = requests.get("https://abcdefghijk.xml")
res_text = result.text
soup = bs4.BeautifulSoup(res_text, features="xml")
title_tag_list = soup.select('title')
for titles in title_tag_list:
title = titles.text
title_list.append(title)
print(f"title_list 0 is, {title_list}")
print("5")
except Exception as e:
print(f'An error occurred {str(e)}')
main()

Webscraper Returning Blank Results (IDE issue ? )

I was assisted with below code for a webscraper by one of the very helpful chaps on here however - it has all of a sudden stopped returning results. It either returns blank set() or nothing at all.
Does the below work for you ? Need to know if it's an issue with my IDE as it doesn't make any sense for it to be working one minute then giving random results the next when no amends was made to the code.
from requests_html import HTMLSession
import requests
def get_source(url):
try:
session = HTMLSession()
response = session.get(url)
return response
except requests.exceptions.RequestException as e:
print(e)
def scrape_google(query, start):
response = get_source(f"https://www.google.co.uk/search?q={query}&start={start}")
links = list(response.html.absolute_links)
google_domains = ('https://www.google.',
'https://google.',
'https://webcache.googleusercontent.',
'http://webcache.googleusercontent.',
'https://policies.google.',
'https://support.google.',
'https://maps.google.')
for url in links[:]:
if url.startswith(google_domains):
links.remove(url)
return links
data = []
for i in range(3):
data.extend(scrape_google('best place', i * 10))
print(set(data))

Having trouble returning 404 responses with asyncio/aiohttp

import time
import asyncio
import aiohttp
async def is_name_available(s, name):
async with s.get("https://twitter.com/%s" % name) as res:
if res.raise_for_status == 404:
print('%s is available!' % name)
return name
async def check_all_names(names):
async with aiohttp.ClientSession(raise_for_status=True) as s:
tasks = []
for name in names:
task = asyncio.create_task(is_name_available(s, name))
tasks.append(task)
return await asyncio.gather(*tasks)
def main():
with open('names.txt') as in_file, open('available.txt', 'w') as out_file:
names = [name.strip() for name in in_file]
start_time = time.time()
results = asyncio.get_event_loop().run_until_complete(check_all_names(names))
results = [i for i in results if i]
out_file.write('\n'.join(results))
print(f'[ <? ] Checked {len(names)} words in {round(time.time()-start_time, 2)} second(s)')
if __name__ == '__main__':
main()
I cannot seem to figure out how to go about returning only 404'd links in is_name_available with this asyncio/aiohttp structure I'm using from another project of mine. I'm a beginner in python and any help is appreciated.
This line is incorrect:
if res.raise_for_status == 404:
raise_for_status is a method, so you're supposed to call it, not compare it to a number (which will always return false). And in your case, you don't want to call raise_for_status in the first place because you don't want to raise an exception when encountering 404, but detect it. To detect 404, you can simply write:
if res.status == 404:
Also note that you don't want to specify raise_for_status=True because it will raise an exception for 404 before the if gets a chance to run.

Python urllib.request shows RemoteDisconnected error with OpenElevation API

I have an application (written in PyQt5) that returns x, y, and elevation of a location. When the user fills up the x, y, and hits getz button, the app calls the function below:
def getz(self, i):
"""calculates the elevation"""
import urllib
url = "https://api.open-elevation.com/api/v1/lookup"
x = self.lineEditX.text()
y = self.lineEditY.text()
url = url + "\?locations\={},{}".format(x, y)
print(url)
if i is "pushButtonSiteZ":
response = urllib.request.Request(url)
fp= urllib.request.urlopen(response)
print('response is '+ response)
self.lineEditSiteZ.setText(fp)
according to Open Elevation guide, it says that you have to make requests in the form of:
curl https://api.open-elevation.com/api/v1/lookup\?locations\=50.3354,10.4567
in order to get elevation data as a JSON object. But in my case it returns an error saying:
raise RemoteDisconnected("Remote end closed connection without"
RemoteDisconnected: Remote end closed connection without response
and nothing happens. How can I fix this?
There is no other way than to create a loop (try until the response is ok). Because the Open Elevation API's handling of so many responses is still problematic. But the following piece of code works after a possibly long delay:
def getz(self, i):
import json
import requests
url = "https://api.open-elevation.com/api/v1/lookup"
"""calculates the elevation"""
if i is 'pushButtonSiteZ':
x = self.lineEditSiteX.text()
y = self.lineEditSiteY.text()
param = url + '?locations={},{}'.format(x,y)
print(param)
while True:
try:
response = requests.get(param)
print(response.status_code)
if str(response.status_code) == '200':
r = response.text
r = json.loads(r)
out = r['results'][0]['elevation']
print(out)
self.lineEditSiteZ.setText(str(out))
cal_rng(self)
break
except ConnectionError:
continue
except json.decoder.JSONDecodeError:
continue
except KeyboardInterrupt:
continue
except requests.exceptions.SSLError:
continue
except requests.exceptions.ConnectionError:
continue

Scraping a forum: cannot scrape posts that have a table within them

I've almost finished writing my first scraper!
I've run into a snag, however: I can't seem to grab the contents of posts that contain a table (posts that cite another post, in other words).
This is the code that extracts post contents from the soup object. It works just fine:
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
...#Error management
return (post_contents)
Here's an example of what I need to scrape (highlighted in yellow):
Problem post
(URL, just in case: http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906)
How do I get the contents that I've highlighted? And why does my current getPostcontents function not work in this particular instance? As far as I can see, the strings are still under div class=post_contents.
EDIT EDIT EDIT
This is how I am getting my BeautifulSoup:
from bs4 import BeautifulSoup as Soup
def getHTMLsoup(url):
try:
html = urlopen(url)
...#Error management
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
...#Error management
return (soup0bj)
EDIT2 EDIT2 EDIT2
These are the relevant bits of the scraper: (Sorry about the dump!)
from bs4 import BeautifulSoup as Soup
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError, URLError
import time, re
def getHTMLsoup(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
print('The server hosting{} is unavailable.'.format(url), '\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
except URLError as e:
return None
print('The webpage found at {} is unavailable.'.format(url),'\n')
print('Trying again in 10 minutes...','\n')
time.sleep(600)
getHTMLsoup(url)
try:
soup0bj = Soup(html.read().decode('utf-8', 'replace'))
time.sleep(5)
except AttributeError as e:
return None
print("Ooops, {}'s HTML structure wasn't detected.".format(url),'\n')
return soup0bj
def getMessagetable(soup0bj):
try:
soup0bj = (soup0bj)
messagetable = []
for data in soup0bj.findAll('tr', {'class' : re.compile('message.*')}, recursive = 'True'):
except AttributeError as e:
print(' ')
return (messagetable)
def getTime_stamps(soup0bj):
try:
soup0bj = (soup0bj)
time_stamps = []
for stamp in soup0bj.findAll('span', {'class' : 'topic_posted'}):
time_stamps.append(re.search('..\/..\/20..', stamp.text).group(0))
except AttributeError as e:
print('No time-stamps found. Moving on.','\n')
return (time_stamps)
def getHandles(soup0bj):
try:
soup0bj = (soup0bj)
handles = []
for handle in soup0bj.findAll('span', {'data-id_user' : re.compile('.*')}, limit = 1):
handles.append(handle.text)
except AttributeError as e:
print("")
return (handles)
def getPost_contents(soup0bj):
try:
soup0bj = (soup0bj)
post_contents = []
for content in soup0bj.findAll('div', {'class' : 'post_content'}, recursive = 'True'):
post_contents.append(content.text.strip())
except AttributeError as e:
print('Ooops, something has gone wrong!')
return (post_contents)
html = ('http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm')
for soup in getHTMLsoup(html):
for messagetable in getMessagetable(soup):
print(getTime_stamps(messagetable),'\n')
print(getHandles(messagetable),'\n')
print(getPost_contents(messagetable),'\n')
The problem is your decoding, it is not utf-8, if you remove the "replace" your code will error with:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 253835: invalid continuation byte
The data seems to be latin-1 encoded, decoding to latin-1 causes no errors but the output does look off in certain parts, using.
html = urlopen(r).read().decode("latin-1")
will work but as I mentioned, you get weird output like:
"diabète en cas d'accident de la route ou malaise isolÊ ou autre ???"
Another option would be to pass an accept-charset header:
from urllib.request import Request, urlopen
headers = {"accept-charset":"utf-8"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
html = urlopen(r).read()
I get the exact same encoding issue using requests letting it handle the encoding, it is like the data has mixed encoding, some utf-8 and some latin-1. The headers returned from requests show the content-encoding as gzip as :
'Content-Encoding': 'gzip'
if we specify we want gzip and decode:
from urllib.request import Request, urlopen
headers = {"Accept-Encoding":"gzip"}
r = Request("http://forum.doctissimo.fr/sante/diabete/savoir-diabetique-sujet_170840_1.htm#t657906",headers=headers)
r = urlopen(r)
import gzip
gzipFile = gzip.GzipFile(fileobj=r)
print(gzipFile.read().decode("latin-1"))
We get the same errors with utf-8 and the same weird output decoding to latin-1. Interestingly in python2 both requests and urllib both work fine.
Using chardet:
r = urlopen(r)
import chardet
print(chardet.detect(r.read()))
reckons with around 71 percent confidence that it is ISO-8859-2 but that again gives the same bad output.
{'confidence': 0.711104254322944, 'encoding': 'ISO-8859-2'}

Resources