I am working with twint to download some twitter followers. Every now and then, twint will throw an error when it cannot find the "more" button. This is described here: https://github.com/twintproject/twint/issues/340.
My workaround is to loop through until I've gotten the number of followers I'd like. However, I would like to keep track of the number of times that the "more" button was not found using cons_errors. Looking at feed.py, https://github.com/twintproject/twint/blob/master/twint/feed.py, an IndexError is raised when this button is not found. How can I track when this happens?
This is the code from within feed.py that raises the error:
def Follow(response):
logme.debug(__name__+':Follow')
soup = BeautifulSoup(response, "html.parser")
follow = soup.find_all("td", "info fifty screenname")
cursor = soup.find_all("div", "w-button-more")
try:
cursor = findall(r'cursor=(.*?)">', str(cursor))[0]
except IndexError:
logme.critical(__name__+':Follow:IndexError')
return follow, cursor
The code I am using is as follows but never catches the exception, rather,twint.run.Followers(c) prints this error message to the console: CRITICAL:root:twint.feed:Follow:IndexError and the loop continues iterating without ever printing in except block or incrementing cons_errors. In other words, the try is always successful.
import twint
import time
c = twint.Config()
c.Limit = 1000 #download 1,000 followers
c.Username = "MarketGoldberg" #random account chosen with a 79 followers
c.Output = "followers.txt" #where to save followers
c.Resume = "resume.txt" #user to pick up from if called again
download_rounds = 2 # intentionally high to force error, all 79 followers will be downloaded in the first iteration
cons_errors = 0 # number of consecutive errors received from twint.
while download_rounds > 0 and cons_errors <= 10:
try:
twint.run.Followers(c)
cons_errors = 0
download_rounds -= 1
except IndexError as err:
print("in except block")
cons_errors += 1
time.sleep(5)
Since the exception is already caught in the package, you can't catch it a second time. You could modify the package, remove the exception handling and handle it yourself. Or you could add a logging handler which counts log messages that end with :Follow:IndexError.
Related
I'm trying to understand if it's possible to set a loop inside of a Try/Except call, or if I'd need to restructure to use functions. Long story short, after spending a few hours learning Python and BeautifulSoup, I managed to frankenstein some code together to scrape a list of URLs, pull that data out to CSV (and now update it to a MySQL db). The code is now working as planned, except that I occasionally run into a 10054, either because my VPN hiccups, or possibly the source host server is occasionally bouncing me (I have a 30 second delay in my loop but it still kicks me on occasion).
I get the general idea of Try/Except structure, but I'm not quite sure how I would (or if I could) loop inside it to try again. My base code to grab the URL, clean it and parse the table I need looks like this:
for url in contents:
print('Processing record', (num+1), 'of', len(contents))
if url:
print('Retrieving data from ', url[0])
html = requests.get(url[0]).text
soup = BeautifulSoup(html, 'html.parser')
for span in soup('span'):
span.decompose()
trs = soup.select('div#collapseOne tr')
if trs:
print('Processing')
for t in trs:
for header, value in zip(t.select('td')[0], t.select('td:nth-child(2)')):
if num == 0:
headers.append(' '.join(header.split()))
values.append(re.sub(' +', ' ', value.get_text(' ', strip=True)))
After that is just processing the data to CSV and running an update sql statement.
What I'd like to do is if the HTML request call fails is wait 30 seconds, try the request again, then process, or if the retry fails X number of times, go ahead and exit the script (assuming at that point I have a full connection failure).
Is it possible to do something like that in line, or would I need to make the request statement into a function and set up a loop to call it? Have to admit I'm not familiar with how Python works with function returns yet.
You can add an inner loop for the retries and put your try/except block in that. Here is a sketch of what it would look like. You could put all of this into a function and put that function call in its own try/except block to catch other errors that cause the loop to exit.
Looking at requests exception hierarchy, Timeout covers multiple recoverable exceptions and is a good start for everything you may want to catch. Other things like SSLError aren't going to get better just because you retry, so skip them. You can go through the list to see what is reasonable for you.
import itertools
# requests exceptions at
# https://requests.readthedocs.io/en/master/_modules/requests/exceptions/
for url in contents:
print('Processing record', (num+1), 'of', len(contents))
if url:
print('Retrieving data from ', url[0])
retry_count = itertools.count()
# loop for retries
while True:
try:
# get with timeout and convert http errors to exceptions
resp = requests.get(url[0], timeout=10)
resp.raise_for_status()
# the things you want to recover from
except requests.Timeout as e:
if next(retry_count) <= 5:
print("timeout, wait and retry:", e)
time.sleep(30)
continue
else:
print("timeout, exiting")
raise # reraise exception to exit
except Exception as e:
print("unrecoverable error", e)
raise
break
html = resp.text
etc…
I've done a little example by myself to graphic this, and yes, you can put loops inside try/except blocks.
from sys import exit
def example_func():
try:
while True:
num = input("> ")
try:
int(num)
if num == "10":
print("Let's go!")
else:
print("Not 10")
except ValueError:
exit(0)
except:
exit(0)
example_func()
This is a fairly simple program that takes input and if it's 10, then it says "Let's go!", otherwise it tells you it's not 10 (if it's not a valid value, it just kicks you out).
Notice that inside the while loop I put a try/except block, taking into account the necessary indentations. You can take this program as a model and use it on your favor.
First, I have never used selenium until yesterday. I was able to scrape the target table correctly after many attempts.
I am currently trying to scrape the tables on sequential pages. It works sometimes and other times it fails immediately. I have spent hours surfing Google and Stack Overflow, but I have not solve my problem. I am sure the answer is something simple, but after 8 hours I need to ask a question to the experts in selenium.
My target url is: RedHat Security Advisories
If there is a question on Stack Overflow that answers my problem, please let me know and I will do some my research and testing.
Here are some of the items that I have tried:
Example 1:
page_number = 0
while True:
try:
page_number += 1
browser.execute_script("return arguments[0].scrollIntoView(true);",
WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
'2]/dir-pagination-controls/ul/li[str(page_number))]'))))
browser.find_element_by_xpath('//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[str(page_number)').click()
print(f"Navigating to page {page_number}")
# I added this because my connection was
# being terminated by RedHat
time.sleep(20)
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
except Exception as e:
print (e)
break
Example 2:
page_number = 0
while True:
try:
page_number += 1
browser.execute_script("return arguments[0].scrollIntoView(true);",
WebDriverWait(browser, 30).until(EC.element_to_be_clickable((By.XPATH, '//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div['
'2]/dir-pagination-controls/ul/li[12]'))))
browser.find_element_by_xpath('//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[12]').click()
print(f"Navigating to page {page_number}")
# I added this because my connection was
# being terminated by RedHat
time.sleep(20)
except (TimeoutException, WebDriverException) as e:
print("Last page reached")
break
except Exception as e:
print (e)
break
You can use the below logic.
lastPage = WebDriverWait(driver,120).until(EC.element_to_be_clickable((By.XPATH,"(//ul[starts-with(#class,'pagination hidden-xs ng-scope')]/li[starts-with(#ng-repeat,'pageNumber')])[last()]")))
driver.find_element_by_css_selector("i.web-icon-plus").click()
pages = lastPage.text
pages = '5'
for pNumber in range(1,int(pages)):
currentPage = WebDriverWait(driver,30).until(EC.element_to_be_clickable((By.XPATH,"//ul[starts-with(#class,'pagination hidden-xs ng-scope')]//a[.='" + str(pNumber) + "']")))
print ("===============================================")
print("Current Page : " + currentPage.text)
currentPage.location_once_scrolled_into_view
currentPage.click()
WebDriverWait(driver,120).until_not(EC.element_to_be_clickable((By.CSS_SELECTOR,"#loading")))
# print rows data here
rows = driver.find_elements_by_xpath("//table[starts-with(#class,'cve-table')]/tbody/tr") #<== getting rows here
for row in rows:
print (row.text) <== I am printing all row data, if you want cell data please update the logic accordingly
time.sleep(randint(1, 5)) #<== this step is optional
I believe you can read data directly using url instead of trying for pagination, this will lead to less sync issues because of which script might be failing
Use this xpath to get total no of pages for the security-updates table.
//*[#id="jumpPoint"]/div[3]/div/div/div[2]/div/div[2]/dir-pagination-controls/ul/li[11]
Run loop till page count get from step 1
Inside loop pass page number in below url and send get request
https://access.redhat.com/security/security-updates/#/security-advisories?q=&p=page_number&sort=portal_publication_date%20desc&rows=10&portal_advisory_type=Security%20Advisory&documentKind=PortalProduct
wait for page to load
Read data from table populated on page
This process will run till the pagination count
Incase you find specific error that site has blocked the user then you can refresh the page with same page_number.
I was collecting users' tweets using TwitterAPI when i stumbled upon this error.
Since i'm planning to crawl atleast 500 tweets with different attributes and each query only returns 100 tweets maxium, i made a function.
!pip install TwitterAPI
from TwitterAPI import TwitterAPI
import json
CONSUMER_KEY = #ENTER YOUR CONSUMER_KEY
CONSUMER_SECRET = #ENTER YOUR CONSUMER_SECRET
OAUTH_TOKEN = #ENTER YOUR OAUTH_TOKEN
OAUTH_TOKEN_SECRET = #ENTER YOUR OAUTH_TOKEN_SECRET
api = TwitterAPI(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
Here's how my function goes:
def retrieve_tweets(api, keyword, batch_count, total_count):
tweets = []
batch_count = str(batch_count)
resp = api.request('search/tweets', {'q': 'keyword',
'count':'batch_count',
'lang':'en',
'result_type':'recent',
}
)
# store the tweets in the list
tweets += resp.json()['statuses']
# find the max_id_str for the next batch
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
# loop until as many tweets as total_count is collected
number_of_tweets = len(tweets)
while number_of_tweets < total_count:
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets, keyword, tweets[number_of_tweets-1]['created_at']))
resp = api.request('search/tweets', {'q': 'keyword',#INSERT YOUR CODE
'count':'batch_count',
'lang':'en',
'result_type': 'recent',
'max_id': 'max_id_str'
}
)
tweets += resp.json()['statuses']
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets, keyword, tweets[number_of_tweets-1]['created_at']))
return tweets
After that, i ran the function as follow:
first_group = retrieve_tweets(api, 'Rock', 100, 500)
It kept running fine until around 180th tweet, then this popped up:
179 tweets are collected for keyword Rock. Last tweet created at Mon Apr 29 02:04:05 +0000 2019
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-cbeb6ede7a5a> in <module>
8 # Your function call should look like this: retrieve_tweets(api,'keyword',single_count,total_count)
9
---> 10 k1_tweets = retrieve_tweets(api, 'Restaurant', 100, 500) #INSERT YOUR CODE HERE
11
12
<ipython-input-7-0d0c87e7c3e9> in retrieve_tweets(api, keyword, batch_count, total_count)
55 )
56
---> 57 tweets += resp.json()['statuses']
58 ids = [tweet['id'] for tweet in tweets]
59 max_id_str = str(min(ids))
KeyError: 'statuses'
It should've been smoothly done till 500 and i've tested the keyword 'statuses' multiple times before.
Additionally, this happened randomly at different point of the tweets collecting phase, there is a time when i managed to finish my first group of 500 tweets. But then, this error would pop up during the collection of the second group
Also, when this error pops up, i can't use the key 'statuses' anymore until i shutdown my editor and run it all over again.
Here's the simple test that i always run before and after the Error occured.
a = api.request('search/tweets', {'q': 'Fun', 'count':'10'})
a1 = a.json()
a1['statuses']
You use dict.get to get value for key statuses, which returns None if the key is not present, other gives the value for key statuses
tweets += resp.json().get('statuses')
if tweets:
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
The JSON response from Twitter will not always contain a statuses. You need to handle a response that contains an errors key as well. Error responses are documented here https://developer.twitter.com/en/docs/ads/general/guides/response-codes.html
Also, your code uses resp.json() to get this JSON structure. This is fine, but you also can use the iterator that comes with TwitterAPI. The iterator will iterate items contained in either statuses or errors. Here is the usage:
resp = api.request('search/tweets', {'q':'pizza'})
for item in resp.get_iterator():
if 'text' in item:
print item['text']
elif 'message' in item:
print '%s (%d)' % (item['message'], item['code'])
One more thing you may not be aware of is TwitterAPI comes with a utility class that will make successive requests and keep track of max_id for you. Here's a short example https://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py
I created a loop (while True) to automate a task on the site with python. This code clicks on two fields until an element appears on the page
(browser.find_element_by_id ('formComp: buttonBack').
When this element appears, I want the loop to stop and go to the next block of code.
I tested it that way, but it made a mistake. Python reported that the element "formComp: buttonback" was not found. But that's just it, if not found continue the loop:
while (browser.find_element_by_id('formComp:repeatCompromissoLista:0:tableRealizacao:0:subtableVinculacoes:0:vinculacao_input')):
vinc = wait.until(EC.presence_of_element_located((By.ID, 'formComp:repeatCompromissoLista:0:tableRealizacao:0:subtableVinculacoes:0:vinculacao_input')))
vinc = browser.find_element_by_id('formComp:repeatCompromissoLista:0:tableRealizacao:0:subtableVinculacoes:0:vinculacao_input')
vinc.send_keys('400')
enterElem5 = wait.until(EC.element_to_be_clickable((By.ID, 'formComp:buttonConfirmar')))
enterElem5 = browser.find_element_by_id('formComp:buttonConfirmar')
enterElem5.send_keys(Keys.ENTER)
time.sleep(int(segundosv))
if (browser.find_element_by_id('formComp:buttonRetornar')== True):
break
else:
continue
Try like this hope this helps.Check the length count of the button more than 0.
if (len(browser.find_elements_by_id('formComp:buttonRetornar'))>0):
break
else:
continue
find_element_by_id() does not return False when an element is not found. Instead, it raises selenium.common.exceptions.NoSuchElementException. You can handle the exception to get the flow control you are looking for:
try:
browser.find_element_by_id('formComp:buttonRetornar')
break
except NoSuchElementException:
continue
I've implemented a memory game where the user has to sort numbers in his head while a timer of 5 sec is running.
Please see code below:
from random import randint
from threading import Timer
def timeout():
print("Time over\n#####################\n")
while True:
list = []
for i in range(5):
list.append(randint(1000, 10000))
t = Timer(5, timeout)
t.start()
print(list)
print('[ 1 , 2 , 3 , 4 , 5 ]')
solution = sorted(list)[2]
print('please select the 3rd largest number in this row (1-5):')
input_string = input()
selection = int(input_string)
if solution == list[selection - 1]:
print("Your guess is correct\n")
else:
print("Your guess is wrong\n")
t.join()
Here is the game interaction itself (please ignore the syntax highlighting):
USER#HOST:~/$ python3 memory_game.py
[8902, 8655, 1680, 6763, 4489]
[ 1 , 2 , 3 , 4 , 5 ]
please select the 3rd largest number in this row (1-5):
4
Your guess is correct
Time over
#####################
[5635, 3810, 1114, 5042, 1481]
[ 1 , 2 , 3 , 4 , 5 ]
please select the 3rd largest number in this row (1-5):
4
Your guess is wrong
Time over
#####################
[6111, 1430, 7999, 3352, 2742]
[ 1 , 2 , 3 , 4 , 5 ]
please select the 3rd largest number in this row (1-5):
23
Traceback (most recent call last):
File "memory_game.py", line 24, in <module>
if solution == list[selection - 1]:
IndexError: list index out of range
Time over
#####################
Can anybody help me with these things:
1. 'Time over' should only be written if the player needs more than 5 sec for the answer. If the player solves it in time the next challenge should appear silently.
2. If the player does not write any guess and presses 'Enter' the program terminates with error message:
Traceback (most recent call last):
File "memory_game.py", line 22, in
selection = int(input_string)
ValueError: invalid literal for int with base 10:''
3. If the player enters any random number the program quits with an 'Index out of range error' - I couldn't find out where to put try: except:
Any help would be appreciated - Thanks!
As for your questions:
You can accomplish that with t.cancel() (stop the timer and do not call the function) instead of t.join() (wait until the thread has finished; this will ALWAYS result in a timeout, of course)
(and 3.) These are basically the same -- put the query message and all input handling into a while loop, and break out of it once you know that the input is valid
...
As an extra, the "time over" message wasn't really doing anything useful (e.g. you could still enter a valid answer after the time over occurred. I "fixed" that in a brute force way by making the program exit if the timeout is hit. Instead of doing that, you can also use a global variable to store whether the timeout was hit or not and handle that in your input accordingly (make sure to make it threadsafe using e.g. a mutex).
In general, it might be easier to turn around the structure of the program -- let the main thread handle the timeout and verification of the input, let the thread only handle the input (this way, it's easy to kill the thread to stop the input from being handled).
Of course, using the select module, one could implement this even nicer without threads (have one pipe that gets written to by a thread/timer, and the standard input, then select both for reading and it will block until either user input or the timeout occurs).
And maybe someone will post a nice asyncio-based solution? ;)
Here's the modified solution (modifying only as little as possible to get it to work, one could refactor other parts to make it nicer in general):
import random
import threading
import os
def timeout():
print("Time over\n#####################\n")
# You can't use sys.exit() here, as this code is running in a thread
# See also: https://stackoverflow.com/a/905224/1047040
os._exit(0)
while True:
list = []
for i in range(5):
list.append(random.randint(1000, 10000))
t = threading.Timer(5, timeout)
t.start()
print(list)
print('[ 1 , 2 , 3 , 4 , 5 ]')
solution = sorted(list)[2]
while True:
try:
print('please select the 3rd largest number in this row (1-5):')
input_string = input()
selection = int(input_string)
chosen = list[selection - 1]
# If we arrive here, the input was valid, break out of the loop
break
except Exception as e:
# Tell the user that the input is wrong; feel free to remove "e"
# from the print output
print('Invalid input:', e)
if solution == chosen:
print("Your guess is correct\n")
else:
print("Your guess is wrong\n")
# Make sure to cancel the thread, otherwise guessing correct or wrong
# will block the CLI interface and still write "time out" at the end
t.cancel()