I'm querying Facebook with the following code, iterating over a list of page names to get the pages' numeric ID and store it in a dictionary. I keep catching a HTTP 500 error, however; this doesn't appear in the short list I present here, though. See code:
import json
def FB_IDs(page_name, access_token=access_token):
""" get page's numeric information """
# construct URL
base = "https://graph.facebook.com/v2.4"
node = "/" + str(page_name)
parameters = "/?access_token=%s" % access_token
url = base + node + parameters
# retrieve data
with urllib.request.urlopen(url) as url:
data = json.loads(url.read().decode())
return data
pages_ids_dict = {}
for page in pages:
pages_ids_dict[page] = FacebookIDs(page, access_token)['id']
How can I automate this and avoid the error?
There is a pretty standard helper function for this, which you might want to look at:
### HELPER FUNCTION ###
def request_until_succeed(url):
""" helper function to catch HTTP error 500"""
req = urllib.request.Request(url)
success = False
while success is False:
try:
response = urllib.request.urlopen(req)
if response.getcode() == 200:
success = True
except Exception as e:
print(e)
time.sleep(5)
print("Error for URL") # use following code if URL shall be printed %s: %s" % (url, datetime.datetime.now()))
return response.read()
Implement that function into yours so your function calls that the URL through that one and you should be fine.
Related
I was trying to learn stock prediction with the help of this github project. but when I run the main.py file given in the repository, via the cmd. I encountered an error
File "/Stock-Predictor/src/tweetstream/streamclasses.py", line 101
except urllib2.HTTPError, exception:
^
SyntaxError: invalid syntax
The below given code is part of a PyPi module named tweetstreami.e. named as tweetstream/streamclasses.py. Which while implementing in a Twitter sentiment analysis project gave the error
import time
import urllib
import urllib2
import socket
from platform import python_version_tuple
import anyjson
from . import AuthenticationError, ConnectionError, USER_AGENT
class BaseStream(object):
"""A network connection to Twitters streaming API
:param username: Twitter username for the account accessing the API.
:param password: Twitter password for the account accessing the API.
:keyword count: Number of tweets from the past to get before switching to
live stream.
:keyword url: Endpoint URL for the object. Note: you should not
need to edit this. It's present to make testing easier.
.. attribute:: connected
True if the object is currently connected to the stream.
.. attribute:: url
The URL to which the object is connected
.. attribute:: starttime
The timestamp, in seconds since the epoch, the object connected to the
streaming api.
.. attribute:: count
The number of tweets that have been returned by the object.
.. attribute:: rate
The rate at which tweets have been returned from the object as a
float. see also :attr: `rate_period`.
.. attribute:: rate_period
The amount of time to sample tweets to calculate tweet rate. By
default 10 seconds. Changes to this attribute will not be reflected
until the next time the rate is calculated. The rate of tweets vary
with time of day etc. so it's useful to set this to something
sensible.
.. attribute:: user_agent
User agent string that will be included in the request. NOTE: This can
not be changed after the connection has been made. This property must
thus be set before accessing the iterator. The default is set in
:attr: `USER_AGENT`.
"""
def __init__(self, username, password, catchup=None, url=None):
self._conn = None
self._rate_ts = None
self._rate_cnt = 0
self._username = username
self._password = password
self._catchup_count = catchup
self._iter = self.__iter__()
self.rate_period = 10 # in seconds
self.connected = False
self.starttime = None
self.count = 0
self.rate = 0
self.user_agent = USER_AGENT
if url: self.url = url
def __enter__(self):
return self
def __exit__(self, *params):
self.close()
return False
def _init_conn(self):
"""Open the connection to the twitter server"""
headers = {'User-Agent': self.user_agent}
postdata = self._get_post_data() or {}
if self._catchup_count:
postdata["count"] = self._catchup_count
poststring = urllib.urlencode(postdata) if postdata else None
req = urllib2.Request(self.url, poststring, headers)
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, self.url, self._username, self._password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
try:
self._conn = opener.open(req)
except urllib2.HTTPError, exception: #___________________________problem here
if exception.code == 401:
raise AuthenticationError("Access denied")
elif exception.code == 404:
raise ConnectionError("URL not found: %s" % self.url)
else: # re raise. No idea what would cause this, so want to know
raise
except urllib2.URLError, exception:
raise ConnectionError(exception.reason)
The second item in the except is an identifier used in the body of the exception to access the exception information. The try/except syntax changed between Python 2 and Python 3 and your code is the Python 2 syntax.
Python 2 (language reference):
try:
...
except <expression>, <identifier>:
...
Python 3 (language reference, rationale):
try:
...
except <expression> as <identifier>:
...
Note that can be a single exception class or a tuple of exception classes to catch more than one type in a single except clause, so to answer your titled question you could use the following to handle more than one possible exception being thrown:
try:
x = array[5] # NameError if array doesn't exist, IndexError if it is too short
except (IndexError,NameError) as e:
print(e) # which was it?
Use...
Try: #code here
Except MyFirstError: #exception handling
Except AnotherError: #exception handling
You can repeat this many times
I am scraping dictionary data from https://www.dictionary.com/ website. The purpose is to remove the unwanted elements from the dictionary pages and save them offline for further processing. Because of the webpages are somewhat unstructured there may and may not be the elements present that are mentioned in the code below to remove; the absence of the elements gives an exception (In snippet 2). And since in the actual code, there are many elements to be removed and they may be present or absent, if we apply the try - except to every such statement the lines of code will increase drasticly.
Thus I am working on a work-around for this problem by creating a separate function for try - except (In snippet 3), the idea of which I got from here. But I am unable to get the code in snippet 3 working as the command such as soup.find_all('style') is returning None where as it should return the list of all the style tags similar to snippet 2. I cannot apply the refered solution directly as sometime I have to reach the intended element to remvove indirectly by refering to its parent or sibling such as in soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent
Snippet 1 is used to set the environment for code execution.
It would be great if you could provide some suggestion to get snippet 3 working.
Snippet 1 (Setting the environment for executing code):
import urllib.request
import requests
from bs4 import BeautifulSoup
import re
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',}
folder = "dictionary_com"
Snippet 2 (working):
def makedefinition(url):
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = soup.find_all("style") # style tags
remove.extend(safe_execute(soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent)) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
Snippet 3 (not working):
soup = None
def safe_execute(command):
global soup
try:
print(soup) # correct soup is printed
print(exec(command)) # this should print the list of style tags but printing None, and for related content this should throw some exception
return exec(command) # None is being returned for style
except Exception:
print(Exception.with_traceback())
return []
def makedefinition(url):
global soup
success = False
while success==False:
try:
request=urllib.request.Request(url,headers=headers)
final_url = urllib.request.urlopen(request, timeout=5).geturl()
r = requests.get(final_url, headers=headers, timeout=5)
success=True
except:
success=False
soup = BeautifulSoup(r.text, 'lxml')
soup = soup.find("section",{'class':'css-1f2po4u e1hj943x0'})
# there are many more elements to remove. mentioned only 2 for shortness
remove = safe_execute("soup.find_all('style')") # style tags
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent")) # related content in the page
for x in remove: x.decompose()
return(soup)
# testing code on multiple urls
#url = "https://www.dictionary.com/browse/a"
#url = "https://www.dictionary.com/browse/a--christmas--carol"
#url = "https://www.dictionary.com/brdivowse/affection"
#url = "https://www.dictionary.com/browse/hot"
#url = "https://www.dictionary.com/browse/move--on"
url = "https://www.dictionary.com/browse/cuckold"
#url = "https://www.dictionary.com/browse/fear"
maggi = makedefinition(url)
with open(folder+"/demo.html", "w") as file:
file.write(str(maggi))
In your code in snippet 3 you use the exec builtin method which returns None regardless of what it does with its argument. For details see this SO thread.
Remedy:
Use exec to modify a variable and return it instead of returning the output of exec itself.
def safe_execute(command):
d = {}
try:
exec(command, d)
return d['output']
except Exception:
print(Exception.with_traceback())
return []
Then call it as something like this:
remove = safe_execute("output = soup.find_all('style')")
EDIT:
Upon execution of this code, again None is returned. Upon debugging however, inside try section if we print(soup) a correct soup value is printed, but exec(command,d) gives NameError: name 'soup' is not defined.
This disparity have been overcome by using eval() instead of exec(). The function defined is:
def safe_execute(command):
global soup
try:
output = eval(command)
return(output)
except Exception:
return []
And the call looks like:
remove = safe_execute("soup.find_all('style')")
remove.extend(safe_execute("soup.find('h2',{'class':'css-1iltn77 e17deyx90'}).parent"))
https://developers.google.com/youtube/v3/code_samples/apps-script#subscribe_to_channel
Hello,
I cant figure how to subscribe to a youtube channel with a post request. Im not looking to use YoutubeSubscriptions as shown above. Im simple looking to pass an api key, but cant seem to figure it out. Any suggestions?
If you don't want to use the YoutubeSubscriptions, you have to get the session_token after login youtube account.
The session_token is stored in the hidden input tag:
document.querySelector('input[name=session_token]').value
or full-text search XSRF_TOKEN field, the corresponding value is session_token, reference regular:
const regex = /\'XSRF_TOKEN\':(.*?)\"(.*?)\"/g
Below is an implementation in Python:
def YouTubeSubscribe(url,SessionManager):
while(1):
try:
html = SessionManager.get(url).content
session_token = (re.findall("XSRF_TOKEN\W*(.*)=", html , re.IGNORECASE)[0]).split('"')[0]
id_yt = url.replace("https://www.youtube.com/channel/","")
params = (('name', 'subscribeEndpoint'),)
data = [
('sej', '{"clickTrackingParams":"","commandMetadata":{"webCommandMetadata":{"url":"/service_ajax","sendPost":true}},"subscribeEndpoint":{"channelIds":["'+id_yt+'"],"params":"EgIIAg%3D%3D"}}'),
('session_token', session_token+"=="),
]
response = SessionManager.post('https://www.youtube.com/service_ajax', params=params, data=data)
check_state = json.loads(response.content)['code']
if check_state == "SUCCESS":
return 1
else:
return 0
except Exception as e:
print "[E] YouTubeSubscribe:"+ str(e)
pass
Hi I am using the following function to parse a list of img reference urls. The reference urls are in format "https://......strings/"; the parsed urls are in format "http://..../img.jpg".
Input: a text file, each line is a reference url
Expected output: a list of parsed urls.
Parsing Rules:
For each input url, requests parse it to html. html.url will either return 1) the original url or 2) a successfully parsed url. If the original url is returned, use BeautifulSoup to extract the img url inside the html. If somehow the extraction fails, resume original url. Then try the whole process again until maximum try_num is reached or success.
Actual behavior:
success: http://ww3.sinaimg.cn/large/005ZD4gwgw1fa6u080sdlj30ku112n4b.jpg
success: http://ww1.sinaimg.cn/large/005ZD4gwgw1fa6tzfpvkwj305t058t8o.jpg
success: https://weibo.cn/mblog/oripic?id=EjvpZdb4y&u=005ZD4gwgw1fa6tzvk95tg30n20cqqva&rl=1
As you can see, even with unsuccessful parsing (3rd), the function outputs "success" without any of the error message I put. I have tried manually parsing those reference urls and tested "html.url == imgref"; they are good. So it somehow happens when I feed a list of urls to it. I am really not sure where it went wrong. Seems all try..except blocks did not catch any exception.
Apologize ahead if I have made some very stupid mistakes and for the ambiguous title. I really don't know where the problem is so I can't find a good tile.
Thank you all.
def decode_imgurl(imgref,cookie):
connection_timeout = 90
try_num = 1
imgurl = imgref
start_time = time.time()
while imgurl == imgref:
while True:
try:
html = requests.get(imgref,cookies=cookie, timeout=60)
break
except Exception:
if time.time() > start_time + connection_timeout:
raise Exception('Unable to get connection %s after %s seconds of ConnectionErrors' \
% (imgref, self.connection_timeout))
else:
time.sleep(1)
if html.url == imgref:
soup = BeautifulSoup(html.content,"lxml")
try:
imgurl = soup.findAll("a",href=re.compile(r'^http://',re.I))[0]['href']
except IndexError: # happens with continuous requests in short time
print("requests too fast")
imgurl = imgref
time.sleep(1)
else:
imgurl = html.url
try_num += 1
if try_num > 10:
print("fail to parse. Please parse manually")
return ""
print("success: %s" % imgurl)
return imgurl
from config import cookie # import cookie file
with open(inputfile, 'r') as f:
for line in f:
if line.strip():
parsed_url = decode_imgurl(line.strip(), cookie)
print(parsed_url)
I'm trying to extract emails from web pages, here is my email grabber function:
def emlgrb(x):
email_set = set()
for url in x:
try:
response = requests.get(url)
soup = bs.BeautifulSoup(response.text, "lxml")
emails = set(re.findall(r"[a-z0-9\.\-+_]+#[a-z0-9\.\-+_]+\.[a-z]+", soup.text, re.I))
email_set.update(emails)
except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
continue
return email_set
This function should be fed by another function, that creates a list of url. Feeder function:
def handle_local_links(url, link):
if link.startswith("/"):
return "".join([url, link])
return link
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link.encode("ascii")) for link in links]
return links
It continues with many exceptions, which if raised - return empty list(not important). However return value from get_links() look like this:
["b'https://pythonprogramming.net/parsememcparseface//'"]
of course there are many of links in the list(cannot post it - reputation). emlgrb() function is not able to process the list (InvalidSchema: No connection adapters were found) However if I manually remove b and redundant quotes - so the list looks like this:
['https://pythonprogramming.net/parsememcparseface//']
emlgrb() works. Any suggestion where is the problem or haw to create "cleaning function" to get second list from first - are welcomed.
Thanks
The solution is to drop .encode('ascii')
def get_links(url):
try:
response = requests.get(url, timeout=5)
soup = bs.BeautifulSoup(response.text, "lxml")
body = soup.body
links = [link.get("href") for link in body.find_all("a")]
links = [handle_local_links(url, link) for link in links]
links = [str(link) for link in links]
return links
You can add coding in str() like in this pydoc: str(object=b'', encoding='utf-8', errors='strict')
That's because str() calls .__repr__() or .__str__() on the object, thus if it is bytes, then output is "b'string'". Actually that's what gets printed when you do print(bytes_obj). And calling .ecnode() on str object creates bytes object!