How to fix KeyError 'statuses' while collecting tweets? - python-3.x

I was collecting users' tweets using TwitterAPI when i stumbled upon this error.
Since i'm planning to crawl atleast 500 tweets with different attributes and each query only returns 100 tweets maxium, i made a function.
!pip install TwitterAPI
from TwitterAPI import TwitterAPI
import json
CONSUMER_KEY = #ENTER YOUR CONSUMER_KEY
CONSUMER_SECRET = #ENTER YOUR CONSUMER_SECRET
OAUTH_TOKEN = #ENTER YOUR OAUTH_TOKEN
OAUTH_TOKEN_SECRET = #ENTER YOUR OAUTH_TOKEN_SECRET
api = TwitterAPI(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
Here's how my function goes:
def retrieve_tweets(api, keyword, batch_count, total_count):
tweets = []
batch_count = str(batch_count)
resp = api.request('search/tweets', {'q': 'keyword',
'count':'batch_count',
'lang':'en',
'result_type':'recent',
}
)
# store the tweets in the list
tweets += resp.json()['statuses']
# find the max_id_str for the next batch
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
# loop until as many tweets as total_count is collected
number_of_tweets = len(tweets)
while number_of_tweets < total_count:
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets, keyword, tweets[number_of_tweets-1]['created_at']))
resp = api.request('search/tweets', {'q': 'keyword',#INSERT YOUR CODE
'count':'batch_count',
'lang':'en',
'result_type': 'recent',
'max_id': 'max_id_str'
}
)
tweets += resp.json()['statuses']
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets, keyword, tweets[number_of_tweets-1]['created_at']))
return tweets
After that, i ran the function as follow:
first_group = retrieve_tweets(api, 'Rock', 100, 500)
It kept running fine until around 180th tweet, then this popped up:
179 tweets are collected for keyword Rock. Last tweet created at Mon Apr 29 02:04:05 +0000 2019
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-cbeb6ede7a5a> in <module>
8 # Your function call should look like this: retrieve_tweets(api,'keyword',single_count,total_count)
9
---> 10 k1_tweets = retrieve_tweets(api, 'Restaurant', 100, 500) #INSERT YOUR CODE HERE
11
12
<ipython-input-7-0d0c87e7c3e9> in retrieve_tweets(api, keyword, batch_count, total_count)
55 )
56
---> 57 tweets += resp.json()['statuses']
58 ids = [tweet['id'] for tweet in tweets]
59 max_id_str = str(min(ids))
KeyError: 'statuses'
It should've been smoothly done till 500 and i've tested the keyword 'statuses' multiple times before.
Additionally, this happened randomly at different point of the tweets collecting phase, there is a time when i managed to finish my first group of 500 tweets. But then, this error would pop up during the collection of the second group
Also, when this error pops up, i can't use the key 'statuses' anymore until i shutdown my editor and run it all over again.
Here's the simple test that i always run before and after the Error occured.
a = api.request('search/tweets', {'q': 'Fun', 'count':'10'})
a1 = a.json()
a1['statuses']

You use dict.get to get value for key statuses, which returns None if the key is not present, other gives the value for key statuses
tweets += resp.json().get('statuses')
if tweets:
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)

The JSON response from Twitter will not always contain a statuses. You need to handle a response that contains an errors key as well. Error responses are documented here https://developer.twitter.com/en/docs/ads/general/guides/response-codes.html
Also, your code uses resp.json() to get this JSON structure. This is fine, but you also can use the iterator that comes with TwitterAPI. The iterator will iterate items contained in either statuses or errors. Here is the usage:
resp = api.request('search/tweets', {'q':'pizza'})
for item in resp.get_iterator():
if 'text' in item:
print item['text']
elif 'message' in item:
print '%s (%d)' % (item['message'], item['code'])
One more thing you may not be aware of is TwitterAPI comes with a utility class that will make successive requests and keep track of max_id for you. Here's a short example https://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py

Related

Can an except block of python have 2 conditions simultaneously?

I was trying to learn stock prediction with the help of this github project. but when I run the main.py file given in the repository, via the cmd. I encountered an error
File "/Stock-Predictor/src/tweetstream/streamclasses.py", line 101
except urllib2.HTTPError, exception:
^
SyntaxError: invalid syntax
The below given code is part of a PyPi module named tweetstreami.e. named as tweetstream/streamclasses.py. Which while implementing in a Twitter sentiment analysis project gave the error
import time
import urllib
import urllib2
import socket
from platform import python_version_tuple
import anyjson
from . import AuthenticationError, ConnectionError, USER_AGENT
class BaseStream(object):
"""A network connection to Twitters streaming API
:param username: Twitter username for the account accessing the API.
:param password: Twitter password for the account accessing the API.
:keyword count: Number of tweets from the past to get before switching to
live stream.
:keyword url: Endpoint URL for the object. Note: you should not
need to edit this. It's present to make testing easier.
.. attribute:: connected
True if the object is currently connected to the stream.
.. attribute:: url
The URL to which the object is connected
.. attribute:: starttime
The timestamp, in seconds since the epoch, the object connected to the
streaming api.
.. attribute:: count
The number of tweets that have been returned by the object.
.. attribute:: rate
The rate at which tweets have been returned from the object as a
float. see also :attr: `rate_period`.
.. attribute:: rate_period
The amount of time to sample tweets to calculate tweet rate. By
default 10 seconds. Changes to this attribute will not be reflected
until the next time the rate is calculated. The rate of tweets vary
with time of day etc. so it's useful to set this to something
sensible.
.. attribute:: user_agent
User agent string that will be included in the request. NOTE: This can
not be changed after the connection has been made. This property must
thus be set before accessing the iterator. The default is set in
:attr: `USER_AGENT`.
"""
def __init__(self, username, password, catchup=None, url=None):
self._conn = None
self._rate_ts = None
self._rate_cnt = 0
self._username = username
self._password = password
self._catchup_count = catchup
self._iter = self.__iter__()
self.rate_period = 10 # in seconds
self.connected = False
self.starttime = None
self.count = 0
self.rate = 0
self.user_agent = USER_AGENT
if url: self.url = url
def __enter__(self):
return self
def __exit__(self, *params):
self.close()
return False
def _init_conn(self):
"""Open the connection to the twitter server"""
headers = {'User-Agent': self.user_agent}
postdata = self._get_post_data() or {}
if self._catchup_count:
postdata["count"] = self._catchup_count
poststring = urllib.urlencode(postdata) if postdata else None
req = urllib2.Request(self.url, poststring, headers)
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm()
password_mgr.add_password(None, self.url, self._username, self._password)
handler = urllib2.HTTPBasicAuthHandler(password_mgr)
opener = urllib2.build_opener(handler)
try:
self._conn = opener.open(req)
except urllib2.HTTPError, exception: #___________________________problem here
if exception.code == 401:
raise AuthenticationError("Access denied")
elif exception.code == 404:
raise ConnectionError("URL not found: %s" % self.url)
else: # re raise. No idea what would cause this, so want to know
raise
except urllib2.URLError, exception:
raise ConnectionError(exception.reason)
The second item in the except is an identifier used in the body of the exception to access the exception information. The try/except syntax changed between Python 2 and Python 3 and your code is the Python 2 syntax.
Python 2 (language reference):
try:
...
except <expression>, <identifier>:
...
Python 3 (language reference, rationale):
try:
...
except <expression> as <identifier>:
...
Note that can be a single exception class or a tuple of exception classes to catch more than one type in a single except clause, so to answer your titled question you could use the following to handle more than one possible exception being thrown:
try:
x = array[5] # NameError if array doesn't exist, IndexError if it is too short
except (IndexError,NameError) as e:
print(e) # which was it?
Use...
Try: #code here
Except MyFirstError: #exception handling
Except AnotherError: #exception handling
You can repeat this many times

Exception Handling From Imported Package

I am working with twint to download some twitter followers. Every now and then, twint will throw an error when it cannot find the "more" button. This is described here: https://github.com/twintproject/twint/issues/340.
My workaround is to loop through until I've gotten the number of followers I'd like. However, I would like to keep track of the number of times that the "more" button was not found using cons_errors. Looking at feed.py, https://github.com/twintproject/twint/blob/master/twint/feed.py, an IndexError is raised when this button is not found. How can I track when this happens?
This is the code from within feed.py that raises the error:
def Follow(response):
logme.debug(__name__+':Follow')
soup = BeautifulSoup(response, "html.parser")
follow = soup.find_all("td", "info fifty screenname")
cursor = soup.find_all("div", "w-button-more")
try:
cursor = findall(r'cursor=(.*?)">', str(cursor))[0]
except IndexError:
logme.critical(__name__+':Follow:IndexError')
return follow, cursor
The code I am using is as follows but never catches the exception, rather,twint.run.Followers(c) prints this error message to the console: CRITICAL:root:twint.feed:Follow:IndexError and the loop continues iterating without ever printing in except block or incrementing cons_errors. In other words, the try is always successful.
import twint
import time
c = twint.Config()
c.Limit = 1000 #download 1,000 followers
c.Username = "MarketGoldberg" #random account chosen with a 79 followers
c.Output = "followers.txt" #where to save followers
c.Resume = "resume.txt" #user to pick up from if called again
download_rounds = 2 # intentionally high to force error, all 79 followers will be downloaded in the first iteration
cons_errors = 0 # number of consecutive errors received from twint.
while download_rounds > 0 and cons_errors <= 10:
try:
twint.run.Followers(c)
cons_errors = 0
download_rounds -= 1
except IndexError as err:
print("in except block")
cons_errors += 1
time.sleep(5)
Since the exception is already caught in the package, you can't catch it a second time. You could modify the package, remove the exception handling and handle it yourself. Or you could add a logging handler which counts log messages that end with :Follow:IndexError.

Ensuring unique timestamps generation in asyncio/aiohttp coroutines

I'm rewriting a web scraper with aiohttp. At some point, it has to make a POST request with a payload notably including a 'CURRENT_TIMESTAMP_ID'. These requests seem to always succeed, but they sometimes are redirected (302 status code) to another location, as additional details need to be fetched to be displayed on the page. Those redirections often fails ("A system error occurred" or "not authorized" error message is displayed on the page), and I don't know why.
I guess it's because they sometimes share the same value for 'CURRENT_TIMESTAMP_ID' (because headers and cookies are the same). Thus, I'd like to generate different timestamps in each request but I had no success doing that. I tried using some randomness with stuffs like asyncio.sleep(1+(randint(500, 2000) / 1000)). Also, note that doing the scraping with task_limit=1 succeeds (see code below).
Here is the relevant part of my code:
async def search(Number, session):
data = None
loop = asyncio.get_running_loop()
while data is None:
t = int(round(time() * 1000)) #often got the same value here
payload = {'Number': Number,
'CURRENT_TIMESTAMP_ID': t}
params = {'CURRENT_TIMESTAMP_ID': t}
try:
async with session.post(SEARCH_URL, data=payload, params=params) as resp:
resp.raise_for_status()
data = await resp.text()
return data
except aiohttp.ClientError as e:
print(f'Error with number{Number}: {e}')
It's called via:
async def main(username, password):
headers = {'User-Agent': UserAgent().random}
async with aiohttp.ClientSession(headers=headers) as session:
await login(session, username, password)
"""Perform the following operations:
1. Fetch a bunch of urls concurrently, with a limit of x tasks
2. Gather the results into chunks of size y
3. Process the chunks in parallel using z different processes
"""
partial_search = async_(partial(search, session=session)) #I'm using Python 3.7
urls = ['B5530'] * 3 #trying to scrape the same URL 3 times
results = await ( #I'm using aiostream cause I got a huge list of urls. Problem also occurs with gather.
stream.iterate(urls)
| pipe.map(partial_search, ordered=False, task_limit=100)
| pipe.chunks(100 // cpu_count())
| pipe.map(process_in_executor, ordered=False, task_limit=cpu_count() + 1)
)
Hope someone will see what I'm missing!

crawl only tweets metadata without the tweet text using an ID list

CONTEXT: I have a list of tweet ids and their textual content and I need to crawl their metadata. However, my code crawls the tweet metadata and text as well. Since I have about 100K tweet ids I do not wish to waste time crawling the tweet text again.
Question: How can I adapt the following code so I would be able to download only tweet metadata. I'm using tweepy and python 3.6.
def get_tweets_single(twapi, idfilepath):
#tweet_id = '522778758168580098'
tw_list = []
with open(idfilepath,'r') as f1:#A File that Contains tweet IDS
lines = f1.readlines()
for line in lines:
try:
print(line.rstrip('\n'))
tweet = twapi.get_status(line.rstrip('\n'))#tweepy function to crawl tweet metadata
tw_list.append(tweet)
#tweet = twapi.statuses_lookup(id_=tweet_id,include_entities=True, trim_user=True)
with open(idjsonFile,'a',encoding='utf-8')as f2:
json.dump(tweet._json,f2)
except tweepy.TweepError as te:
print('Failed to get tweet ID %s: %s', tweet_id, te.message)
def main(args):
print('hello')
# connect to twitter
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)
get_tweets_single(api, idfilepath)
You cannot only download metadata about the tweet.
Looking at the documentation you can choose to exclude information about the user with trim_user=true - but that's the only thing you can strip out.

How to retrieve all historical public tweets with Twitter Premium Search API in Sandbox version (using next token)

I want to download all historical tweets with certain hashtags and/or keywords for a research project. I got the Premium Twitter API for that. I'm using the amazing TwitterAPI to take care of auth and so on.
My problem now is that I'm not an expert developer and I have some issues understanding how the next token works, and how to get all the tweets in a csv.
What I want to achieve is to have all the tweets in one single csv, without having to manually change the dates of the fromDate and toDate values. Right now I don't know how to get the next token and how to use it to concatenate requests.
So far I got here:
from TwitterAPI import TwitterAPI
import csv
SEARCH_TERM = 'my-search-term-here'
PRODUCT = 'fullarchive'
LABEL = 'here-goes-my-dev-env'
api = TwitterAPI("consumer_key",
"consumer_secret",
"access_token_key",
"access_token_secret")
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200603220000',
'toDate':'201806020000'
}
)
csvFile = open('2006-2018.csv', 'a')
csvWriter = csv.writer(csvFile)
for item in r:
csvWriter.writerow([item['created_at'],item['user']['screen_name'], item['text'] if 'text' in item else item])
I would be really thankful for any help!
Cheers!
First of all, TwitterAPI includes a helper class that will take care of this for you. TwitterPager works with many types of Twitter endpoints, not just Premium Search. Here is an example to get you started: https://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py
But to answer your question, the strategy you should take is to put the request you currently have inside a while loop. Then,
1. Each request will return a next field which you can get with r.json()['next'].
2. When you are done processing the current batch of tweets and ready for your next request, you would include the next parameter set to the value above.
3. Finally, eventually a request will not include a next in the the returned json. At that point break out of the while loop.
Something like the following.
next = ''
while True:
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200603220000',
'toDate':'201806020000',
'next':next})
if r.status_code != 200:
break
for item in r:
csvWriter.writerow([item['created_at'],item['user']['screen_name'], item['text'] if 'text' in item else item])
json = r.json()
if 'next' not in json:
break
next = json['next']

Resources