How to use Boto3 pagination - python-3.x

BACKGROUND:
The AWS operation to list IAM users returns a max of 50 by default.
Reading the docs (links) below I ran following code and returned a complete set data by setting the "MaxItems" to 1000.
paginator = client.get_paginator('list_users')
response_iterator = paginator.paginate(
PaginationConfig={
'MaxItems': 1000,
'PageSize': 123})
for page in response_iterator:
u = page['Users']
for user in u:
print(user['UserName'])
http://boto3.readthedocs.io/en/latest/guide/paginators.html
https://boto3.readthedocs.io/en/latest/reference/services/iam.html#IAM.Paginator.ListUsers
QUESTION:
If the "MaxItems" was set to 10, for example, what would be the best method to loop through the results?
I tested with the following but it only loops 2 iterations before 'IsTruncated' == False and results in "KeyError: 'Marker'". Not sure why this is happening because I know there are over 200 results.
marker = None
while True:
paginator = client.get_paginator('list_users')
response_iterator = paginator.paginate(
PaginationConfig={
'MaxItems': 10,
'StartingToken': marker})
#print(response_iterator)
for page in response_iterator:
u = page['Users']
for user in u:
print(user['UserName'])
print(page['IsTruncated'])
marker = page['Marker']
print(marker)
else:
break

(Answer rewrite)
**NOTE **, the paginator contains a bug that doesn't tally with the documentation (or vice versa). MaxItems doesn't return the Marker or NextToken when total items exceed MaxItems number. Indeed PageSize is the one that controlling return of Marker/NextToken indictator.
import sys
import boto3
iam = boto3.client("iam")
marker = None
while True:
paginator = iam.get_paginator('list_users')
response_iterator = paginator.paginate(
PaginationConfig={
'PageSize': 10,
'StartingToken': marker})
for page in response_iterator:
print("Next Page : {} ".format(page['IsTruncated']))
u = page['Users']
for user in u:
print(user['UserName'])
try:
marker = response_iterator['Marker']
print(marker)
except KeyError:
sys.exit()
It is not your mistake that your code doesn't works. MaxItems in the paginator seems become a "threshold" indicator. Ironically, the MaxItems inside original boto3.iam.list_users still works as mentioned.
If you check boto3.iam.list_users, you will notice either you omit Marker, otherwise you must put a value. Apparently, paginator is NOT a wrapper for all boto3 class list_* method.
import sys
import boto3
iam = boto3.client("iam")
marker = None
while True:
if marker:
response_iterator = iam.list_users(
MaxItems=10,
Marker=marker
)
else:
response_iterator = iam.list_users(
MaxItems=10
)
print("Next Page : {} ".format(response_iterator['IsTruncated']))
for user in response_iterator['Users']:
print(user['UserName'])
try:
marker = response_iterator['Marker']
print(marker)
except KeyError:
sys.exit()
You can follow up the issue I filed in boto3 github. According to the member, you can call build_full_result after paginate(), that will show the desire behavior.

This code wasn't working for me. It always drops off the remainder of the items on the last page and doesn't include them in the results. Gives me a result of 60 accounts when I know I have 68. That last result page doesn't get appended to my list of account UserName's. I have concerns that these above examples are doing the same thing and people aren't noticing this in the results.
That and it seems overly complex to paginate through with an arbitrary size for what purpose?
This should be simple and gives you a complete listing.
import boto3
iam = boto3.client("iam")
paginator = iam.get_paginator('list_users')
response_iterator = paginator.paginate()
accounts=[]
for page in response_iterator:
for user in page['Users']:
accounts.append(user['UserName'])
len(accounts)
68

This post is pretty old but due to the lack of concise documetation I want to share my code for all of those that are struggling with this
Here are two simple examples of how I solved it using Boto3's paginator hoping this helps you understand how it works
Boto3 official pagination documentation:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/paginators.html
AWS API specifying that the first token should be $null (None in Python):
https://docs.aws.amazon.com/powershell/latest/reference/items/Get-SSMParametersByPath.html
Examples:
First example with little complexity for people like me who struggled to understand how this works:
def read_ssm_parameters():
page_iterator = paginator.paginate(
Path='path_to_the_parameters',
Recursive=True,
PaginationConfig={
'MaxItems': 10,
'PageSize': 10,
}
)
while myNextToken:
for page in page_iterator:
print('# This is new page')
print(page['Parameters'])
if 'NextToken' in page.keys():
print(page['NextToken'])
myNextToken=page['NextToken']
else:
myNextToken=False
page_iterator = paginator.paginate(
Path=baseSSMPath,
Recursive=True,
PaginationConfig={
'MaxItems': 10,
'PageSize': 10,
'StartingToken': myNextToken
}
)
Second example with reduced code but without the complexity of using recursion
def read_ssm_parameters(myNextToken='None'):
while myNextToken:
page_iterator = paginator.paginate(
Path='path_to_the_parameters',
Recursive=True,
PaginationConfig={
'MaxItems': 10,
'PageSize': 10,
'StartingToken': myNextToken
}
)
for page in page_iterator:
if 'NextToken' in page.keys():
print('# This is a new page')
myNextToken=page['NextToken']
print(page['Parameters'])
else:
# Exit if there are no more pages to read
myNextToken=False
Hope this helps!

I will post my solution here and hopefully help other people do their job faster instead of fiddling around with the amazingly written boto3 api calls.
My use case was to list all the Security Hub ControlIds using the SecurityHub.Client.describe_standards_controls function.
controlsResponse = sh_client.describe_standards_controls(
StandardsSubscriptionArn = enabledStandardSubscriptionArn)
controls = controlsResponse.get('Controls')
# This is the token for the 101st item in the list.
nextToken = controlsResponse.get('NextToken')
# Call describe_standards_controls with the token set at item 101 to get the next 100 results
controlsResponse1 = sh_client.describe_standards_controls(
StandardsSubscriptionArn = enabledStandardSubscriptionArn, NextToken=nextToken)
controls1 = controlsResponse1.get('Controls')
# And make the two lists into one
controls.extend(controls1)
No you have a list of all the SH standards controls for the specified Subscription Standard(e.g., AWS foundational Standard)
For example if I want to get all the ControlIds I can just iterate the 'controls' list and do
controlId=control.get("ControlId")
same for other field in the response as it is described here

Related

Appending to list replaces last item in Django middleware

I have a middleware I'm using to retain route history within my Django app to use with breadcrumbs, but for some reason the last item in the list keeps getting replaced rather than the item appending to the end of the list.
ROOT_ROUTE_PATH = '/labels/'
class RouteHistoryMiddleware(object):
request = None
history = None
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
self.request = request
if 'history' not in request.session:
request.session['history'] = []
self.history = request.session['history']
request.session['history'].append(request.path)
if len(self.history) == 0:
self.request.previous_route = ROOT_ROUTE_PATH
elif len(self.history) == 1:
self.request.previous_route = request.session['history'][-1]
elif len(self.history) > 1:
self.request.previous_route = request.session['history'][-2]
return self.get_response(request)
Illustration of request.session['history'] mutation with above:
Open Page A
['/page_a/']
Open Page B
['/page_a/', '/page_b/']
Open Page C
['/page_a/', '/page_c/']
Instead of appending path to session try appending to self.history then overwrite history with new array:
...
self.history = request.session['history']
self.history.append(request.path)
request.session['history'] = self.history
...
You may need to change your if/else conditions after that
The issue you're running into is that Django doesn't know you've modified the list, and so the data is getting written to the session inconsistently. From the documentation:
By default, Django only saves to the session database when the session has been modified – that is if any of its dictionary values have been assigned or deleted.
i.e, if you modify the list without reassigning it or deleting it, then it will not know the session has been modified. Again from the documentation:
we can tell the session object explicitly that it has been modified by setting the modified attribute on the session object.
So your original code should work if you add this line after modifying the list in place:
request.session.modified = True
(Replacing the list entirely, as suggested in the other answer, also works - I'm just trying to explain why your original code didn't work).

Refreshing boto3 session when paginating though cloudtrail

I'm writing a script in python using boto3 to report on the api calls made over the past few months. I have the script pretty much done but we have a max session length of 1 hour and this will always take longer than that and so the session expires and the script dies.
I have tried to refresh the session periodically to stop it from expiring but I cant't seem to make it work. I'm really hoping that someone has done this before and can tell me what I'm doing wrong?
Below is a cut down version of the code.
import boto3
import datetime
import time
from botocore.exceptions import ClientError
session_start_time = datetime.datetime.now()
start_date = datetime.datetime.now()
start_date -= datetime.timedelta(days=1)
end_date = datetime.datetime.now()
role='arn:aws:iam::1234:role/role'
def role_arn_to_session(**args):
client = boto3.client('sts')
response = client.assume_role(**args)
return boto3.Session(
aws_access_key_id=response['Credentials']['AccessKeyId'],
aws_secret_access_key=response['Credentials']['SecretAccessKey'],
aws_session_token=response['Credentials']['SessionToken'])
session = role_arn_to_session(RoleArn=role,RoleSessionName='session')
cloudtrail = session.client('cloudtrail',region_name='us-east-1')
paginator = cloudtrail.get_paginator("lookup_events")
StartingToken = None
page_iterator = paginator.paginate(
PaginationConfig={'PageSize':1000, 'StartingToken':StartingToken },
StartTime=start_date,
EndTime=end_date)
for page in page_iterator:
for ct in page['Events']:
print(ct)
try:
token_file = open("token","w")
token_file.write(page["NextToken"])
StartingToken = page["NextToken"]
except KeyError:
break
if (datetime.datetime.now() - session_start_time).seconds/60 > 10:
page_iterator = None
paginator = None
cloudtrail = None
session = None
session = role_arn_to_session(RoleArn=role,RoleSessionName='session')
cloudtrail = session.client('cloudtrail',region_name='us-east-1')
paginator = cloudtrail.get_paginator("lookup_events")
page_iterator = paginator.paginate(
PaginationConfig={'PageSize':1000, 'StartingToken':StartingToken },
StartTime=start_date,
EndTime=end_date)
session_start_time = datetime.datetime.now()
I'd appreciate any help with this.
Thanks in advance
Your solution does not work because you are just shadowing page_iterator variable, so the changes you make to the iterator does not take effect.
You can increase the session length if you are running your script using long-term credentials.
By default, the temporary security credentials created by AssumeRole last for one hour. However, you can use the optional DurationSeconds parameter to specify the duration of your session.
Otherwise, you need to revise the application logic a bit. You can try using a shorter time frame when fetching the trails, e.g. instead of using 1 day, try using 6 hours and slide the time frame accordingly until you fetch all trails you want. This is a better approach in my opinion.

How to retrieve all historical public tweets with Twitter Premium Search API in Sandbox version (using next token)

I want to download all historical tweets with certain hashtags and/or keywords for a research project. I got the Premium Twitter API for that. I'm using the amazing TwitterAPI to take care of auth and so on.
My problem now is that I'm not an expert developer and I have some issues understanding how the next token works, and how to get all the tweets in a csv.
What I want to achieve is to have all the tweets in one single csv, without having to manually change the dates of the fromDate and toDate values. Right now I don't know how to get the next token and how to use it to concatenate requests.
So far I got here:
from TwitterAPI import TwitterAPI
import csv
SEARCH_TERM = 'my-search-term-here'
PRODUCT = 'fullarchive'
LABEL = 'here-goes-my-dev-env'
api = TwitterAPI("consumer_key",
"consumer_secret",
"access_token_key",
"access_token_secret")
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200603220000',
'toDate':'201806020000'
}
)
csvFile = open('2006-2018.csv', 'a')
csvWriter = csv.writer(csvFile)
for item in r:
csvWriter.writerow([item['created_at'],item['user']['screen_name'], item['text'] if 'text' in item else item])
I would be really thankful for any help!
Cheers!
First of all, TwitterAPI includes a helper class that will take care of this for you. TwitterPager works with many types of Twitter endpoints, not just Premium Search. Here is an example to get you started: https://github.com/geduldig/TwitterAPI/blob/master/examples/page_tweets.py
But to answer your question, the strategy you should take is to put the request you currently have inside a while loop. Then,
1. Each request will return a next field which you can get with r.json()['next'].
2. When you are done processing the current batch of tweets and ready for your next request, you would include the next parameter set to the value above.
3. Finally, eventually a request will not include a next in the the returned json. At that point break out of the while loop.
Something like the following.
next = ''
while True:
r = api.request('tweets/search/%s/:%s' % (PRODUCT, LABEL),
{'query':SEARCH_TERM,
'fromDate':'200603220000',
'toDate':'201806020000',
'next':next})
if r.status_code != 200:
break
for item in r:
csvWriter.writerow([item['created_at'],item['user']['screen_name'], item['text'] if 'text' in item else item])
json = r.json()
if 'next' not in json:
break
next = json['next']

Twitter API, Searching with dollar signs

This code opens a twitter listener, and the search terms are in the variable, upgrades_str. Some searches work, and some don't. I added AMZN to the upgrades list just to be sure there's a frequently used term since this is using an open Twitter stream, and not searching existing tweets.
Below, I think we only need to review numbers 2 and 4.
I'm using Python 3.5.2 :: Anaconda 4.0.0 (64-bit) on Windows 10.
Variable searches
Searching with: upgrades_str: ['AMZN', 'SWK', 'AIQUY', 'SFUN', 'DOOR'] = returns tweets such as 'i'm tired of people'
Searching with: upgrades_str: ['$AMZN', '$SWK', '$AIQUY', '$SFUN', '$DOOR'] = returns tweets as as 'Chicago to south Florida. Hiphop lives'. This search is the one I wish worked.
Explicit searches
Searching by replacing the variable 'upgrades_str' with the explicit string: ['AMZN', 'SWK', 'AIQUY', 'SFUN', 'DOOR'] = returns 'After being walked in on twice, I have finally figured out how to lock the door here in Sweden'. This one at least has the search term 'door'.
Searching by replacing the variable 'upgrades_str' with the explicit string: ['$AMZN', '$SWK', '$AIQUY', '$SFUN', '$DOOR'] = returns '$AMZN $WFM $KR $REG $KIM: Amazon’s Whole Foods buy puts shopping centers at risk as real'. So the explicit call works, but not the identical variable.
Explicitly searching for ['$AMZN'] = returns a good tweet: 'FANG setting up really good for next week! Added $googl jun23 970c avg at 4.36. $FB $AMZN'.
Explicitly searching for ['cool'] returns 'I can’t believe I got such a cool Pillow!'
import tweepy
import dataset
from textblob import TextBlob
from sqlalchemy.exc import ProgrammingError
import json
db = dataset.connect('sqlite:///tweets.db')
class StreamListener(tweepy.StreamListener):
def on_status(self, status):
if status.retweeted:
return
description = status.user.description
loc = status.user.location
text = status.text
coords = status.coordinates
geo = status.geo
name = status.user.screen_name
user_created = status.user.created_at
followers = status.user.followers_count
id_str = status.id_str
created = status.created_at
retweets = status.retweet_count
bg_color = status.user.profile_background_color
blob = TextBlob(text)
sent = blob.sentiment
if geo is not None:
geo = json.dumps(geo)
if coords is not None:
coords = json.dumps(coords)
table = db['tweets']
try:
table.insert(dict(
user_description=description,
user_location=loc,
coordinates=coords,
text=text,
geo=geo,
user_name=name,
user_created=user_created,
user_followers=followers,
id_str=id_str,
created=created,
retweet_count=retweets,
user_bg_color=bg_color,
polarity=sent.polarity,
subjectivity=sent.subjectivity,
))
except ProgrammingError as err:
print(err)
def on_error(self, status_code):
if status_code == 420:
return False
access_token = 'token'
access_token_secret = 'tokensecret'
consumer_key = 'consumerkey'
consumer_secret = 'consumersecret'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
stream_listener = StreamListener()
stream = tweepy.Stream(auth=api.auth, listener=stream_listener)
stream.filter(track=upgrades_str, languages=['en'])
Here's the answer, in case someone has the problem in the future: "Note that punctuation is not considered to be part of a #hashtag or #mention, so a track term containing punctuation will not match either #hashtags or #mentions." From: https://dev.twitter.com/streaming/overview/request-parameters#track
And for multiple terms, the string, which was converted from a list, needs to be changed to ['term1,term2']. Just strip out the apostrophes and spaces:
upgrades_str = re.sub('[\' \[\]]', '', upgrades_str)
upgrades_str = '[\''+format(upgrades_str)+'\']'

Twitter API: How to get users ID, who favorite specific tweet?

I'm trying to get info about users, who added specific tweet to favorites, but I can't find it in documentation.
It is unfair that twitter can do that, but doesn't give this method as API.
Apparently, the only way to do this is to scrape Twitter's website:
import urllib2
from lxml.html import parse
#returns list(retweet users),list(favorite users) for a given screen_name and status_id
def get_twitter_user_rts_and_favs(screen_name, status_id):
url = urllib2.urlopen('https://twitter.com/' + screen_name + '/status/' + status_id)
root = parse(url).getroot()
num_rts = 0
num_favs = 0
rt_users = []
fav_users = []
for ul in root.find_class('stats'):
for li in ul.cssselect('li'):
cls_name = li.attrib['class']
if cls_name.find('retweet') >= 0:
num_rts = int(li.cssselect('a')[0].attrib['data-tweet-stat-count'])
elif cls_name.find('favorit') >= 0:
num_favs = int(li.cssselect('a')[0].attrib['data-tweet-stat-count'])
elif cls_name.find('avatar') >= 0 or cls_name.find('face-pile') >= 0:#else face-plant
for users in li.cssselect('a'):
#apparently, favs are listed before retweets, but the retweet summary's listed before the fav summary
#if in doubt you can take the difference of returned uids here with retweet uids from the official api
if num_favs > 0:#num_rt > 0:
#num_rts -= 1
num_favs -= 1
#rt_users.append(users.attrib['data-user-id'])
fav_users.append(users.attrib['data-user-id'])
else:
#fav_users.append(users.attrib['data-user-id'])
rt_users.append(users.attrib['data-user-id'])
return rt_users, fav_users
#example
if __name__ == '__main__':
print get_twitter_user_rts_and_favs('alien_merchant', '674104400013578240')
Short answer: You can't do this perfectly.
Long answer: You can do this with some effort but it isn't going to be even close to perfect. You can use the twitter api to monitor the activity of up to 4000 user id's. If a tweet is created by one of the 4k people you monitor, then you can get all the information including the people who have favourited the tweet. This also requires that you push all the information about the people you monitor onto a database (I use mongodb). You can then query the database for information about your tweet.
Twitter API v2 has new likes functionality:
https://twittercommunity.com/t/announcing-twitter-api-v2-likes-lookup-and-blocks-lookup/154353
To get users who have liked a Tweet, use the GET /2/tweets/:id/liking_users endpoint.
They've also provided example code on their github repo.
Use the endpoint favorites/list with max_id set to the tweet you're looking for.
https://dev.twitter.com/rest/reference/get/favorites/list

Resources