I want to access past three months of twitter data for training a stock prediction model. I want to run a loop to get data on successive dates and I am using Tweepy in python but I don't know how to get the data on a specific date using tweepy. Also I am a bit confused about data I am currently getting. Is it the data on same day or some previous days? This is my code. How should I modify this code to get Twitter data on specific dates? I am new to coding with python. So it would help a lot if you tell me the code.
import sys
import tweepy
import matplotlib.pyplot as plt
def percentage(part, whole):
return 100*float(part)/float(whole)
consumerKey = "aaaaaaaaaaaaaaaaaaaaaaaaa"
consumerSecret = "bbbbbbbbbbbbbbbbbbbbbbbb"
accessToken = "ccccccccccccccc"
accessTokenSecret = "ddddddddddddddddddddddddddddd"
auth = tweepy.OAuthHandler(consumerKey,consumerSecret)
auth.set_access_token(accessToken, accessTokenSecret)
api = tweepy.API(auth)
searchTerm = input("Enter keyword/hashtag to search about: ")
noOfSearchTerms = int(input("Enter how many tweets to analyze: "))
tweets = tweepy.Cursor(api.search, q = searchTerm).items(noOfSearchTerms)
Related
hey I'm trying to got the tweets has been tweeted since 1 October 2020 until today which contain 'covid' keyword which located on UK using tweepy and export it as csv file using pandas library but the result I've got only from 29 October 2020
this is the part of filtration on the code
import sys
import tweepy as tw
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler , Stream
import json
access_token = "xxxxxx"
access_token_secret = "xxxx"
consumer_key = "xxxxx"
consumer_secret = "xxxxx"
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tw.API(auth, wait_on_rate_limit=True)
search_words ="covid"
date_since = "2020-10-01"
results = []
for tweet in tw.Cursor(api.search,tweet_mode='extended',q=search_words,lang="en",
since=date_since,geocode='51.745719,-1.236599,300km').items(9000):
results.append(tweet.created_at)
print(results)
By default, Tweepy's search method uses Twitter's legacy standard search API, which can only fetch up to 7 days history of Tweets. You will need to use the 30-day search option (the API.search_30_day method), which will also require you to configure a premium search environment in the Twitter developer portal. Note that the search operators and syntax for premium search are different to those in standard search.
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import tweepy
import textblob
import re
from textblob import TextBlob
import pandas as pd
import numpy as np
ACCESS_TOKEN="XXXX"
ACCESS_SECRET="XXXX"
CONSUMER_KEY="XXXX"
CONSUMER_SECRET="XXXX"
def twitter_setup():
"""
Utility function to setup the Twitter's API
with our access keys provided.
"""
# Authentication and access using keys:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
# Return API with authentication:
api = tweepy.API(auth)
return api
extractor = twitter_setup()
tweets = extractor.user_timeline(screen_name="realDonaldTrump", count=200)
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
def clean_tweet(tweet):
'''
Utility function to clean the text in a tweet by removing
links and special characters using regex.
'''
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)", " ", tweet).split())
def analize_sentiment(tweet):
'''
Utility function to classify the polarity of a tweet
using textblob.
'''
analysis = TextBlob(clean_tweet(tweet))
#print(analysis.sentiment.polarity)
if analysis.sentiment.polarity > 0:
return 1
elif analysis.sentiment.polarity == 0:
return 0
else:
return -1
data['SA'] = np.array([ analize_sentiment(tweet) for tweet in data['Tweets'] ])
display(data.head(200))
I am working on a Project, in this project we are extracting tweets of some of the world leaders and then we will try to compare their relationships with other countries based on their twitter comment. So far we have extracted the tweets from Donald Trump Account We have categorized the tweets into positive and negative but what problem I am facing is how we can separate the tweets country-wise, Is their any way by which only those tweets are extracted in which he/she has tweeted about some country and the rest of the tweets are ignored so that we can only get the tweets related to the country.
I don't have enough reputation to add a comment, but you need to know that you have posted all your access tokens and that is a bad idea.
You might load-up a list of countries such as: github repo by marijn. It also contains a list with nationalities github repo by marijn
Check per tweet whether a name in the list occurs (so you would have to iterate over the list). You might add a counter for each country occuring per tweet. Add this counter-data as a column to your dataframe (similar to your earlier approach to analyze the sentiment).
This is just an idea, I'm not able to comment yet due to the fact I'm new.
I want to download all historical tweets with certain hashtags and/or keywords for a research project. I got the Premium Twitter API for that. I'm using the amazing TwitterAPI to take care of auth and so on.
My problem now is that I'm not an expert developer and I have some issues understanding how the next token works, and how to get all the tweets in a csv.
What I want to achieve is to have all the tweets in one single csv, without having to manually change the dates of the fromDate and toDate values. Right now I don't know how to get the next token and how to use it to concatenate requests.
So far I got here:
from TwitterAPI import TwitterAPI
import csv
SEARCH_TERM = 'my-search-term-here'
PRODUCT = 'fullarchive'
LABEL = 'here-goes-my-dev-env'
api = TwitterAPI("consumer_key",
"consumer_secret",
"access_token_key",
"access_token_secret")
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200603220000',
'toDate':'201806020000'
}
)
csvFile = open('2006-2018.csv', 'a')
csvWriter = csv.writer(csvFile)
for item in r:
csvWriter.writerow([item['created_at'],item['user']['screen_name'], item['text'] if 'text' in item else item])
I would be really thankful for any help!
Cheers!
First of all, TwitterAPI includes a helper class that will take care of this for you. TwitterPager works with many types of Twitter endpoints, not just Premium Search. Here is an example to get you started: https://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py
But to answer your question, the strategy you should take is to put the request you currently have inside a while loop. Then,
1. Each request will return a next field which you can get with r.json()['next'].
2. When you are done processing the current batch of tweets and ready for your next request, you would include the next parameter set to the value above.
3. Finally, eventually a request will not include a next in the the returned json. At that point break out of the while loop.
Something like the following.
next = ''
while True:
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200603220000',
'toDate':'201806020000',
'next':next})
if r.status_code != 200:
break
for item in r:
csvWriter.writerow([item['created_at'],item['user']['screen_name'], item['text'] if 'text' in item else item])
json = r.json()
if 'next' not in json:
break
next = json['next']
This code opens a twitter listener, and the search terms are in the variable, upgrades_str. Some searches work, and some don't. I added AMZN to the upgrades list just to be sure there's a frequently used term since this is using an open Twitter stream, and not searching existing tweets.
Below, I think we only need to review numbers 2 and 4.
I'm using Python 3.5.2 :: Anaconda 4.0.0 (64-bit) on Windows 10.
Variable searches
Searching with: upgrades_str: ['AMZN', 'SWK', 'AIQUY', 'SFUN', 'DOOR'] = returns tweets such as 'i'm tired of people'
Searching with: upgrades_str: ['$AMZN', '$SWK', '$AIQUY', '$SFUN', '$DOOR'] = returns tweets as as 'Chicago to south Florida. Hiphop lives'. This search is the one I wish worked.
Explicit searches
Searching by replacing the variable 'upgrades_str' with the explicit string: ['AMZN', 'SWK', 'AIQUY', 'SFUN', 'DOOR'] = returns 'After being walked in on twice, I have finally figured out how to lock the door here in Sweden'. This one at least has the search term 'door'.
Searching by replacing the variable 'upgrades_str' with the explicit string: ['$AMZN', '$SWK', '$AIQUY', '$SFUN', '$DOOR'] = returns '$AMZN $WFM $KR $REG $KIM: Amazon’s Whole Foods buy puts shopping centers at risk as real'. So the explicit call works, but not the identical variable.
Explicitly searching for ['$AMZN'] = returns a good tweet: 'FANG setting up really good for next week! Added $googl jun23 970c avg at 4.36. $FB $AMZN'.
Explicitly searching for ['cool'] returns 'I can’t believe I got such a cool Pillow!'
import tweepy
import dataset
from textblob import TextBlob
from sqlalchemy.exc import ProgrammingError
import json
db = dataset.connect('sqlite:///tweets.db')
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
if status.retweeted:
return
description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
blob = TextBlob(text)
sent = blob.sentiment
if geo is not None:
geo = json.dumps(geo)
if coords is not None:
coords = json.dumps(coords)
table = db['tweets']
try:
table.insert(dict(
user_description=description,
user_location=loc,
coordinates=coords,
text=text,
geo=geo,
user_name=name,
user_created=user_created,
user_followers=followers,
id_str=id_str,
created=created,
retweet_count=retweets,
user_bg_color=bg_color,
polarity=sent.polarity,
subjectivity=sent.subjectivity,
))
except ProgrammingError as err:
print(err)
def on_error(self, status_code):
if status_code == 420:
return False
access_token = 'token'
access_token_secret = 'tokensecret'
consumer_key = 'consumerkey'
consumer_secret = 'consumersecret'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=upgrades_str, languages=['en'])
Here's the answer, in case someone has the problem in the future: "Note that punctuation is not considered to be part of a #hashtag or #mention, so a track term containing punctuation will not match either #hashtags or #mentions." From: https://dev.twitter.com/streaming/overview/request-parameters#track
And for multiple terms, the string, which was converted from a list, needs to be changed to ['term1,term2']. Just strip out the apostrophes and spaces:
upgrades_str = re.sub('[\' \[\]]', '', upgrades_str)
upgrades_str = '[\''+format(upgrades_str)+'\']'
Noob python user:
I've created file that extracts 10 tweets based on the api.search (not streaming api). I get a screen results, but cannot figure how to parse the output to save to csv. My error is TypeError: expected a character buffer object.
I have tried using .join(str(x) and get other errors.
My code is
import tweepy
import time
from tweepy import OAuthHandler
from tweepy import Cursor
#Consumer keys and access tokens, used for Twitter OAuth
consumer_key = ''
consumer_secret = ''
atoken = ''
asecret = ''
# The OAuth process that uses keys and tokens
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(atoken, asecret)
# Creates instance to execute requests to Twitter API
api = tweepy.API(auth)
MarSec = tweepy.Cursor(api.search, q='maritime security').items(10)
for tweet in MarSec:
print " "
print tweet.created_at, tweet.text, tweet.lang
saveFile = open('MarSec.csv', 'a')
saveFile.write(tweet)
saveFile.write('\n')
saveFile.close()
Any help would be appreciated. I've gotten my Streaming API to work, but am having difficulty with this one.
Thanks.
tweet is not a string or a character buffer. It's an object. Replace your line with saveFile.write(tweet.text) and you'll be good to go.
saveFile = open('MarSec.csv', 'a')
for tweet in MarSec:
print " "
print tweet.created_at, tweet.text, tweet.lang
saveFile.write("%s %s %s\n"%(tweet.created_at, tweet.lang, tweet.text))
saveFile.close()
I just thought I'd put up another version for those who might want to save all
the attributes of a tweepy.models.Status object, if you're not yet sure which attributes of each tweet you want to save to file.
import json
search_results = []
for status in tweepy.Cursor(api.search, q=search_text).items(5000):
search_results.append(status._json)
with open('search_results.json', 'w') as f:
json.dump(search_results, f)
The first block will store the search results into a list of dictionaries, and the second block will output all the tweets into a json file.
Please beware, this might use up a lot of memory if the size of your search results is very big.
This is Twitter's classic error code when something is wrong while sending a wrong image.
Try to find images you are trying to upload and check the format of the images.
The only thing I did was erase the images that MY media player of Windows can´t read and that's all! the script run perfectly.