Youtube api V3 nextPageToken repeats - pagination

I am getting some weird results using url to retrieve youtube playlist items. First of all, youtube playlist can contain max of 200 playlist items.
https://www.googleapis.com/youtube/v3/playlistItems?part=snippet,status,contentDetails&maxResults=50&playlistId=PLFgquLnL59alCl_2TQvOiD5Vgm1hCaGSI&key=API_KEY
When I run this I get correct results (50 items returned, total results 200, results per page 50, nextPageToken: "CDIQAA")
Then I keep running new request always with last nextPageToken:
https://www.googleapis.com/youtube/v3/playlistItems?part=snippet,status,contentDetails&maxResults=49&playlistId=PLFgquLnL59alCl_2TQvOiD5Vgm1hCaGSI&key=API_KEY&pageToken=CDEQAA
100 results, nextPageToken: "CGQQAA",
https://www.googleapis.com/youtube/v3/playlistItems?part=snippet,status,contentDetails&maxResults=49&playlistId=PLFgquLnL59alCl_2TQvOiD5Vgm1hCaGSI&key=API_KEY&pageToken=CGQQAA
150 results so far, nextPageToken: "CDIQAA"
Now this nextPageToken keeps repeating its the same at first nextPageToken, why, I still havent retrieved all 200 results?

I guess there some logical issue in your code, I've got CJYBEAA token after third request.
Here function that works fine with your playlist ID and returns whole 200 video IDs:
def getPlaylistVideosIDs(playlist_id):
videos_IDs = []
search = YOUR_YOUTUBE_KEY.playlistItems().list(part='snippet', playlistId=playlist_id,
maxResults=50).execute()
try:
nextPageToken = search['nextPageToken']
except KeyError:
nextPageToken = None
for item in search['items']:
videos_IDs.append(item['snippet']['resourceId']['videoId'])
while (nextPageToken):
search = YOUR_YOUTUBE_KEY.playlistItems().list(pageToken=nextPageToken, part='snippet',
playlistId=playlist_id,
maxResults=50).execute()
for item in search['items']:
videos_IDs.append(item['snippet']['resourceId']['videoId'])
try:
nextPageToken = search['nextPageToken']
except KeyError:
break
return videos_IDs

Related

Can not get client.command parameter to parse API response by key value in discord.py

I'm building a command onto an existing bot that will search an API and take a baseball player's name as a parameter to query a json response with. I've gotten everything to work correctly in test, only for the life of me I can't figure out how to restrict the results to only those that include the query parameter that is passed when the command is invoked within discord.
For example: a user will type !card Adam Dunn and only the value "Adam Dunn" for the key "name" will return. Currently, the entire first page of results is being sent no matter what is typed for the parameter, and with my embed logic running, each result gets a separate embed, which isn't ideal.
I've only included the pertinent lines of code and not included the massive embed of the results for readability's sake.
It's got to be something glaringly simple, but I think I've just been staring at it for too long to see it. Any help would be greatly appreciated, thank you!
Below is a console output when the command is run:
Here is the code I'm currently working with:
async def card(ctx, *, player_name: str):
async with ctx.channel.typing():
async with aiohttp.ClientSession() as cs:
async with cs.get("https://website.items.json") as r:
data = await r.json()
listings = data["items"]
for k in listings:
if player_name == k["name"]
print()```
I hope I understood you right. If the user did not give a player_name Then you will just keep searching for nothing, and you want to end if there is no player_name given. if that is the case then.
Set the default value of player_name: str=None to be None then check at the beginning of your code if it is there.
async def card(ctx, *, player_name: str=None):
if not player_name:
return await ctx.send('You must enter a player name')
# if there is a name do this
async with ctx.channel.typing():
async with aiohttp.ClientSession() as cs:
async with cs.get("https://theshownation.com/mlb20/apis/items.json") as r:
data = await r.json()
listings = data["items"]
for k in listings:
if player_name == k["name"]
print()```
Update:
I'm an idiot. Works as expected, but because the player_name I was searching for wasn't on the first page of results, it wasn't showing. When using a player_name that is on the first page of the API results, it works just fine.
This is a pagination issue, not a key value issue.

Ensuring unique timestamps generation in asyncio/aiohttp coroutines

I'm rewriting a web scraper with aiohttp. At some point, it has to make a POST request with a payload notably including a 'CURRENT_TIMESTAMP_ID'. These requests seem to always succeed, but they sometimes are redirected (302 status code) to another location, as additional details need to be fetched to be displayed on the page. Those redirections often fails ("A system error occurred" or "not authorized" error message is displayed on the page), and I don't know why.
I guess it's because they sometimes share the same value for 'CURRENT_TIMESTAMP_ID' (because headers and cookies are the same). Thus, I'd like to generate different timestamps in each request but I had no success doing that. I tried using some randomness with stuffs like asyncio.sleep(1+(randint(500, 2000) / 1000)). Also, note that doing the scraping with task_limit=1 succeeds (see code below).
Here is the relevant part of my code:
async def search(Number, session):
data = None
loop = asyncio.get_running_loop()
while data is None:
t = int(round(time() * 1000)) #often got the same value here
payload = {'Number': Number,
'CURRENT_TIMESTAMP_ID': t}
params = {'CURRENT_TIMESTAMP_ID': t}
try:
async with session.post(SEARCH_URL, data=payload, params=params) as resp:
resp.raise_for_status()
data = await resp.text()
return data
except aiohttp.ClientError as e:
print(f'Error with number{Number}: {e}')
It's called via:
async def main(username, password):
headers = {'User-Agent': UserAgent().random}
async with aiohttp.ClientSession(headers=headers) as session:
await login(session, username, password)
"""Perform the following operations:
1. Fetch a bunch of urls concurrently, with a limit of x tasks
2. Gather the results into chunks of size y
3. Process the chunks in parallel using z different processes
"""
partial_search = async_(partial(search, session=session)) #I'm using Python 3.7
urls = ['B5530'] * 3 #trying to scrape the same URL 3 times
results = await ( #I'm using aiostream cause I got a huge list of urls. Problem also occurs with gather.
stream.iterate(urls)
| pipe.map(partial_search, ordered=False, task_limit=100)
| pipe.chunks(100 // cpu_count())
| pipe.map(process_in_executor, ordered=False, task_limit=cpu_count() + 1)
)
Hope someone will see what I'm missing!

How to fix KeyError 'statuses' while collecting tweets?

I was collecting users' tweets using TwitterAPI when i stumbled upon this error.
Since i'm planning to crawl atleast 500 tweets with different attributes and each query only returns 100 tweets maxium, i made a function.
!pip install TwitterAPI
from TwitterAPI import TwitterAPI
import json
CONSUMER_KEY = #ENTER YOUR CONSUMER_KEY
CONSUMER_SECRET = #ENTER YOUR CONSUMER_SECRET
OAUTH_TOKEN = #ENTER YOUR OAUTH_TOKEN
OAUTH_TOKEN_SECRET = #ENTER YOUR OAUTH_TOKEN_SECRET
api = TwitterAPI(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
Here's how my function goes:
def retrieve_tweets(api, keyword, batch_count, total_count):
tweets = []
batch_count = str(batch_count)
resp = api.request('search/tweets', {'q': 'keyword',
'count':'batch_count',
'lang':'en',
'result_type':'recent',
}
)
# store the tweets in the list
tweets += resp.json()['statuses']
# find the max_id_str for the next batch
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
# loop until as many tweets as total_count is collected
number_of_tweets = len(tweets)
while number_of_tweets < total_count:
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets, keyword, tweets[number_of_tweets-1]['created_at']))
resp = api.request('search/tweets', {'q': 'keyword',#INSERT YOUR CODE
'count':'batch_count',
'lang':'en',
'result_type': 'recent',
'max_id': 'max_id_str'
}
)
tweets += resp.json()['statuses']
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
print("{} tweets are collected for keyword {}. Last tweet created at {}".format(number_of_tweets, keyword, tweets[number_of_tweets-1]['created_at']))
return tweets
After that, i ran the function as follow:
first_group = retrieve_tweets(api, 'Rock', 100, 500)
It kept running fine until around 180th tweet, then this popped up:
179 tweets are collected for keyword Rock. Last tweet created at Mon Apr 29 02:04:05 +0000 2019
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-9-cbeb6ede7a5a> in <module>
8 # Your function call should look like this: retrieve_tweets(api,'keyword',single_count,total_count)
9
---> 10 k1_tweets = retrieve_tweets(api, 'Restaurant', 100, 500) #INSERT YOUR CODE HERE
11
12
<ipython-input-7-0d0c87e7c3e9> in retrieve_tweets(api, keyword, batch_count, total_count)
55 )
56
---> 57 tweets += resp.json()['statuses']
58 ids = [tweet['id'] for tweet in tweets]
59 max_id_str = str(min(ids))
KeyError: 'statuses'
It should've been smoothly done till 500 and i've tested the keyword 'statuses' multiple times before.
Additionally, this happened randomly at different point of the tweets collecting phase, there is a time when i managed to finish my first group of 500 tweets. But then, this error would pop up during the collection of the second group
Also, when this error pops up, i can't use the key 'statuses' anymore until i shutdown my editor and run it all over again.
Here's the simple test that i always run before and after the Error occured.
a = api.request('search/tweets', {'q': 'Fun', 'count':'10'})
a1 = a.json()
a1['statuses']
You use dict.get to get value for key statuses, which returns None if the key is not present, other gives the value for key statuses
tweets += resp.json().get('statuses')
if tweets:
ids = [tweet['id'] for tweet in tweets]
max_id_str = str(min(ids))
number_of_tweets = len(tweets)
The JSON response from Twitter will not always contain a statuses. You need to handle a response that contains an errors key as well. Error responses are documented here https://developer.twitter.com/en/docs/ads/general/guides/response-codes.html
Also, your code uses resp.json() to get this JSON structure. This is fine, but you also can use the iterator that comes with TwitterAPI. The iterator will iterate items contained in either statuses or errors. Here is the usage:
resp = api.request('search/tweets', {'q':'pizza'})
for item in resp.get_iterator():
if 'text' in item:
print item['text']
elif 'message' in item:
print '%s (%d)' % (item['message'], item['code'])
One more thing you may not be aware of is TwitterAPI comes with a utility class that will make successive requests and keep track of max_id for you. Here's a short example https://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py

crawl only tweets metadata without the tweet text using an ID list

CONTEXT: I have a list of tweet ids and their textual content and I need to crawl their metadata. However, my code crawls the tweet metadata and text as well. Since I have about 100K tweet ids I do not wish to waste time crawling the tweet text again.
Question: How can I adapt the following code so I would be able to download only tweet metadata. I'm using tweepy and python 3.6.
def get_tweets_single(twapi, idfilepath):
#tweet_id = '522778758168580098'
tw_list = []
with open(idfilepath,'r') as f1:#A File that Contains tweet IDS
lines = f1.readlines()
for line in lines:
try:
print(line.rstrip('\n'))
tweet = twapi.get_status(line.rstrip('\n'))#tweepy function to crawl tweet metadata
tw_list.append(tweet)
#tweet = twapi.statuses_lookup(id_=tweet_id,include_entities=True, trim_user=True)
with open(idjsonFile,'a',encoding='utf-8')as f2:
json.dump(tweet._json,f2)
except tweepy.TweepError as te:
print('Failed to get tweet ID %s: %s', tweet_id, te.message)
def main(args):
print('hello')
# connect to twitter
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
api = tweepy.API(auth)
get_tweets_single(api, idfilepath)
You cannot only download metadata about the tweet.
Looking at the documentation you can choose to exclude information about the user with trim_user=true - but that's the only thing you can strip out.

Filter out own retweets Tweepy

My application (Python3 + Tweepy) finds a hashtag and retweets it.
I get a "Retweet is not permissible for this status" error because of retweeting own tweets.
How to filter out?
# retweet function
def hashtag_Retweet():
print (tweet.id)
api.retweet(tweet.id) # retweet
print(tweet.text)
return
query = '#foosball'
our_own_id = '3678887154' #Made up for this post
tweets = api.search(query)
for tweet in tweets:
# make sure that tweet does not come from host
hashtag_Retweet()
Something like this would work.
for tweet in tweets:
if tweet.user.id != our_own_id:
hashtag_Retweet()
Hope it helps.

Resources