Refreshing boto3 session when paginating though cloudtrail

Refreshing boto3 session when paginating though cloudtrail - python-3.x

I'm writing a script in python using boto3 to report on the api calls made over the past few months. I have the script pretty much done but we have a max session length of 1 hour and this will always take longer than that and so the session expires and the script dies.
I have tried to refresh the session periodically to stop it from expiring but I cant't seem to make it work. I'm really hoping that someone has done this before and can tell me what I'm doing wrong?
Below is a cut down version of the code.
import boto3
import datetime
import time
from botocore.exceptions import ClientError
session_start_time = datetime.datetime.now()
start_date = datetime.datetime.now()
start_date -= datetime.timedelta(days=1)
end_date = datetime.datetime.now()
role='arn:aws:iam::1234:role/role'
def role_arn_to_session(**args):
client = boto3.client('sts')
response = client.assume_role(**args)
return boto3.Session(
aws_access_key_id=response['Credentials']['AccessKeyId'],
aws_secret_access_key=response['Credentials']['SecretAccessKey'],
aws_session_token=response['Credentials']['SessionToken'])
session = role_arn_to_session(RoleArn=role,RoleSessionName='session')
cloudtrail = session.client('cloudtrail',region_name='us-east-1')
paginator = cloudtrail.get_paginator("lookup_events")
StartingToken = None
page_iterator = paginator.paginate(
PaginationConfig={'PageSize':1000, 'StartingToken':StartingToken },
StartTime=start_date,
EndTime=end_date)
for page in page_iterator:
for ct in page['Events']:
print(ct)
try:
token_file = open("token","w")
token_file.write(page["NextToken"])
StartingToken = page["NextToken"]
except KeyError:
break
if (datetime.datetime.now() - session_start_time).seconds/60 > 10:
page_iterator = None
paginator = None
cloudtrail = None
session = None
session = role_arn_to_session(RoleArn=role,RoleSessionName='session')
cloudtrail = session.client('cloudtrail',region_name='us-east-1')
paginator = cloudtrail.get_paginator("lookup_events")
page_iterator = paginator.paginate(
PaginationConfig={'PageSize':1000, 'StartingToken':StartingToken },
StartTime=start_date,
EndTime=end_date)
session_start_time = datetime.datetime.now()
I'd appreciate any help with this.
Thanks in advance

Your solution does not work because you are just shadowing page_iterator variable, so the changes you make to the iterator does not take effect.
You can increase the session length if you are running your script using long-term credentials.
By default, the temporary security credentials created by AssumeRole last for one hour. However, you can use the optional DurationSeconds parameter to specify the duration of your session.
Otherwise, you need to revise the application logic a bit. You can try using a shorter time frame when fetching the trails, e.g. instead of using 1 day, try using 6 hours and slide the time frame accordingly until you fetch all trails you want. This is a better approach in my opinion.

Related

Cassandra Python driver not fetching the next page of results

DataStax driver for Cassandra Version 3.25.0,
Python version 3.9
Session.execute() fetches the first 100 records. As per the documentation, the driver is supposed to
tranparently fetch next pages as we reach the end of first page. However, it fetches the same page again and again and hence the first 100 records is all that is ever accessible.
The for loop that prints records goes infinite.
ssl_context.verify_mode = CERT_NONE
cluster = Cluster(contact_points=[db_host], port=db_port,
auth_provider = PlainTextAuthProvider(db_user, db_pwd),
ssl_context=ssl_context
)
session = cluster.connect()
query = "SELECT * FROM content_usage"
statement = SimpleStatement(query, fetch_size=100)
results = session.execute(statement)
for row in results:
print(f"{row}")
I could see other similar threads, but they are not answered too. Has anyone encountered this issue before? Any help is appreciated.

I'm a bit confused by the initial statement of the problem. You mentioned that the initial page of results is fetched repeatedly and that these are the only results available to your program. You also indicated that the for loop responsible for printing results turns into an infinite loop when you run the program. These statements seem contradictory to me; how can you know what the driver has fetched if you never get any output? I'm assuming that's what you meant by "goes infinite"... if I'm wrong please correct me.
The following code seems to run as expected against Cassandra 4.0.0 using cassandra-driver 3.25.0 on Python 3.9.0:
import argparse
import logging
import time
from cassandra.cluster import Cluster, SimpleStatement
def setupLogging():
log = logging.getLogger()
log.setLevel('DEBUG')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
def setupSchema(session):
session.execute("""create keyspace if not exists "baz" with replication = {'class':'SimpleStrategy', 'replication_factor':1};""")
session.execute("""create table if not exists baz.qux (run_ts bigint, idx int, uuid timeuuid, primary key (run_ts,idx))""")
session.execute("""truncate baz.qux""")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-d','--debug', action='store_true')
args = parser.parse_args()
cluster = Cluster()
session = cluster.connect()
if args.debug:
setupLogging()
setupSchema(session)
run_ts = int(time.time())
insert_stmt = session.prepare("""insert into baz.qux (run_ts,idx,uuid) values (?,?,now())""")
for idx in range(10000):
session.execute(insert_stmt, [run_ts, idx])
query = "select * from baz.qux"
stmt = SimpleStatement(query, fetch_size=100)
results = session.execute(stmt)
for row in results:
print(f"{row}")
cluster.shutdown()
$ time (python foo.py | wc -l)
10000
real 0m12.452s
user 0m3.786s
sys 0m2.197s
You might try running your sample app with debug logging enabled (see sample code above for how to enable this). It sounds like something might be off in your Cassandra configuration (or perhaps your client setup); the additional logging might help you identify what (if anything) is getting in the way.

The logic in your code is only calling execute() once so the contents of results will only ever be the same list of 100 rows.
You need to call execute() in your loop to get the next page of results like this:
query = "SELECT * FROM content_usage"
statement = SimpleStatement(query, fetch_size=100)
for row in session.execute(statement):
process_row(row)
For more info, see Paging with the Python driver. Cheers!

Below is the code snippet that finally worked for me, after restricting the driver version to 3.20:
statement = session.prepare(query)
# Execute the query once and retrieve the first page of results
results = self.session.execute(statement, params)
for row in results.current_rows:
process_row(row)
# Fetch more pages until they exhaust
while results.has_more_pages:
page_state = results.paging_state
results = session.execute(statement, parameters=params, paging_state=page_state)
for row in results.current_rows:
process_row(row)

deleting a key from a python DiskCache Fanout Cache

I'm trying to understand how to delete keys out of a memoized database call using python DiskCache package. Below is a simple function, that shows how I am memoizing a simple function call, and it works fine, with subsequent calls running much faster.
The documentation says that I can delete specific keys, but I can't see what the key is when it was generated using the memoize decorator
I'd have guessed that it was something like
cache.pop(("__main__slowfunc", 5)), and while this doesn't throw an error, it doesn't remove the key from the cache.
from diskcache import FanoutCache
from pathlib import Path
import os
import time
local = Path(os.environ["AllUsersProfile"]) / "CacheTests"
cacheLocation = local / "cache"
cache = FanoutCache(cacheLocation, timeout=1)
#cache.memoize()
def slowfunc(iterations):
for i in range(0, iterations):
time.sleep(1)
iterations = 6
start = time.time()
slowfunc(iterations)
end = time.time()
print(f"Initial Call = {round(end-start,0)}s")
Any assistance appreciated.
Thanks

Great question, and it's really confusing how your example doesn't work given that it matches what the source code of disk cache appears to do.
I've extended the example a bit and it seems to work for me like this. See if this works for you:
from diskcache import FanoutCache
from pathlib import Path
import os
import time
local = Path(os.environ["AllUsersProfile"]) / "CacheTests"
cacheLocation = local / "cache"
cache = FanoutCache(cacheLocation, timeout=1)
#cache.memoize()
def slowfunc(iterations):
print("Recalculating")
for i in range(0, iterations):
time.sleep(1)
iterations = 3
cache.delete(("__main__slowfunc", iterations))
start = time.time()
slowfunc(iterations)
end = time.time()
print(f"Initial Call = {round(end-start,0)}s")
start = time.time()
slowfunc(iterations)
end = time.time()
print(f"Subsequent Call = {round(end-start,0)}s")
cache.delete(("__main__slowfunc", iterations))
start = time.time()
slowfunc(iterations)
end = time.time()
print(f"After deletion = {round(end-start,0)}s")
Results:
Recalculating
Initial Call = 3.0s
Subsequent Call = 0.0s
Recalculating
After deletion = 3.0s
I tried it with pop instead of delete and that worked too

Error Using geopy library

I have the following question, I want to set up a routine to perform iterations inside a dataframe (pandas) to extract longitude and latitude data, after supplying the address using the 'geopy' library.
The routine I created was:
import time
from geopy.geocoders import GoogleV3
import os
arquivo = pd.ExcelFile('path')
df = arquivo.parse("Table1")
def set_proxy():
proxy_addr = 'http://{user}:{passwd}#{address}:{port}'.format(
user='usuario', passwd='senha',
address='IP', port=int('PORTA'))
os.environ['http_proxy'] = proxy_addr
os.environ['https_proxy'] = proxy_addr
def unset_proxy():
os.environ.pop('http_proxy')
os.environ.pop('https_proxy')
set_proxy()
geo_keys = ['AIzaSyBXkATWIrQyNX6T-VRa2gRmC9dJRoqzss0'] # API Google
geolocator = GoogleV3(api_key=geo_keys )
for index, row in df.iterrows():
location = geolocator.geocode(row['NO_LOGRADOURO'])
time.sleep(2)
lat=location.latitude
lon=location.longitude
timeout=10)
address = location.address
unset_proxy()
print(str(lat) + ', ' + str(lon))
The problem I'm having is that when I run the code the following error is thrown:
GeocoderQueryError: Your request was denied.
I tried the creation without passing the key to the google API, however, I get the following message.
KeyError: 'http_proxy'
and if I remove the unset_proxy () statement from within the for, the message I receive is:
GeocoderQuotaExceeded: The given key has gone over the requests limit in the 24 hour period or has submitted too many requests in too short a period of time.
But I only made 5 requests today, and I'm putting a 2-second sleep between requests. Should the period be longer?
Any idea?

api_key argument of the GoogleV3 class must be a string, not a list of strings (that's the cause of your first issue).
geopy doesn't guarantee the http_proxy/https_proxy env vars to be respected (especially the runtime modifications of the os.environ). The advised (by docs) usage of proxies is:
geolocator = GoogleV3(proxies={'http': proxy_addr, 'https': proxy_addr})
PS: Please don't ever post your API keys to the public. I suggest to revoke the key you've posted in the question and generate a new one, to prevent the possibility of it being abused by someone else.

Python 3.6 Downloading .csv files from finance.yahoo.com using requests module

I was trying to download a .csv file from this url for the history of a stock. Here's my code:
import requests
r = requests.get("https://query1.finance.yahoo.com/v7/finance/download/CHOLAFIN.BO?period1=1514562437&period2=1517240837&interval=1d&events=history&crumb=JaCfCutLNr7")
file = open(r"history_of_stock.csv", 'w')
file.write(r.text)
file.close()
But when I opened the file history_of_stock.csv, this was what I found: {
"finance": {
"error": {
"code": "Unauthorized",
"description": "Invalid cookie"
}
}
}
I couldn't find anything that could fix my problem. I found this thread in which someone has the same problem except that it is in C#: C# Download price data csv file from https instead of http

To complement the earlier answer and provide a concrete completed code, I wrote a script which accomplishes the task of getting historical stock prices in Yahoo Finance. Tried to write it as simply as possible. To give a summary: when you use requests to get a URL, in many instances you don't need to worry about crumbs or cookies. However, with Yahoo finance, you need to get the crumbs and the cookies. Once you get the cookies, then you are good to go! Make sure to set a timeout on the requests.get call.
import re
import requests
import sys
from pdb import set_trace as pb
symbol = sys.argv[-1]
start_date = '1442203200' # start date timestamp
end_date = '1531800000' # end date timestamp
crumble_link = 'https://finance.yahoo.com/quote/{0}/history?p={0}'
crumble_regex = r'CrumbStore":{"crumb":"(.*?)"}'
cookie_regex = r'set-cookie: (.*?);'
quote_link = 'https://query1.finance.yahoo.com/v7/finance/download/{}?period1={}&period2={}&interval=1d&events=history&crumb={}'
link = crumble_link.format(symbol)
session = requests.Session()
response = session.get(link)
# get crumbs
text = str(response.content)
match = re.search(crumble_regex, text)
crumbs = match.group(1)
# get cookie
cookie = session.cookies.get_dict()
url = "https://query1.finance.yahoo.com/v7/finance/download/%s?period1=%s&period2=%s&interval=1d&events=history&crumb=%s" % (symbol, start_date, end_date, crumbs)
r = requests.get(url,cookies=session.cookies.get_dict(),timeout=5, stream=True)
out = r.text
filename = '{}.csv'.format(symbol)
with open(filename,'w') as f:
f.write(out)

There was a service for exactly this but it was discontinued.
Now you can do what you intend but first you need to get a Cookie. On this post there is an example of how to do it.
Basically, first you need to make a useless request to get the Cookie and later, with this Cookie in place, you can query whatever else you actually need.
There's also a post about another service which might make your life easier.
There's also a Python module to work around this inconvenience and code to show how to do it without it.

How to use Boto3 pagination

BACKGROUND:
The AWS operation to list IAM users returns a max of 50 by default.
Reading the docs (links) below I ran following code and returned a complete set data by setting the "MaxItems" to 1000.
paginator = client.get_paginator('list_users')
response_iterator = paginator.paginate(
PaginationConfig={
'MaxItems': 1000,
'PageSize': 123})
for page in response_iterator:
u = page['Users']
for user in u:
print(user['UserName'])
http://boto3.readthedocs.io/en/latest/guide/paginators.html
https://boto3.readthedocs.io/en/latest/reference/services/iam.html#IAM.Paginator.ListUsers
QUESTION:
If the "MaxItems" was set to 10, for example, what would be the best method to loop through the results?
I tested with the following but it only loops 2 iterations before 'IsTruncated' == False and results in "KeyError: 'Marker'". Not sure why this is happening because I know there are over 200 results.
marker = None
while True:
paginator = client.get_paginator('list_users')
response_iterator = paginator.paginate(
PaginationConfig={
'MaxItems': 10,
'StartingToken': marker})
#print(response_iterator)
for page in response_iterator:
u = page['Users']
for user in u:
print(user['UserName'])
print(page['IsTruncated'])
marker = page['Marker']
print(marker)
else:
break

(Answer rewrite)
**NOTE **, the paginator contains a bug that doesn't tally with the documentation (or vice versa). MaxItems doesn't return the Marker or NextToken when total items exceed MaxItems number. Indeed PageSize is the one that controlling return of Marker/NextToken indictator.
import sys
import boto3
iam = boto3.client("iam")
marker = None
while True:
paginator = iam.get_paginator('list_users')
response_iterator = paginator.paginate(
PaginationConfig={
'PageSize': 10,
'StartingToken': marker})
for page in response_iterator:
print("Next Page : {} ".format(page['IsTruncated']))
u = page['Users']
for user in u:
print(user['UserName'])
try:
marker = response_iterator['Marker']
print(marker)
except KeyError:
sys.exit()
It is not your mistake that your code doesn't works. MaxItems in the paginator seems become a "threshold" indicator. Ironically, the MaxItems inside original boto3.iam.list_users still works as mentioned.
If you check boto3.iam.list_users, you will notice either you omit Marker, otherwise you must put a value. Apparently, paginator is NOT a wrapper for all boto3 class list_* method.
import sys
import boto3
iam = boto3.client("iam")
marker = None
while True:
if marker:
response_iterator = iam.list_users(
MaxItems=10,
Marker=marker
)
else:
response_iterator = iam.list_users(
MaxItems=10
)
print("Next Page : {} ".format(response_iterator['IsTruncated']))
for user in response_iterator['Users']:
print(user['UserName'])
try:
marker = response_iterator['Marker']
print(marker)
except KeyError:
sys.exit()
You can follow up the issue I filed in boto3 github. According to the member, you can call build_full_result after paginate(), that will show the desire behavior.

This code wasn't working for me. It always drops off the remainder of the items on the last page and doesn't include them in the results. Gives me a result of 60 accounts when I know I have 68. That last result page doesn't get appended to my list of account UserName's. I have concerns that these above examples are doing the same thing and people aren't noticing this in the results.
That and it seems overly complex to paginate through with an arbitrary size for what purpose?
This should be simple and gives you a complete listing.
import boto3
iam = boto3.client("iam")
paginator = iam.get_paginator('list_users')
response_iterator = paginator.paginate()
accounts=[]
for page in response_iterator:
for user in page['Users']:
accounts.append(user['UserName'])
len(accounts)
68

This post is pretty old but due to the lack of concise documetation I want to share my code for all of those that are struggling with this
Here are two simple examples of how I solved it using Boto3's paginator hoping this helps you understand how it works
Boto3 official pagination documentation:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/paginators.html
AWS API specifying that the first token should be $null (None in Python):
https://docs.aws.amazon.com/powershell/latest/reference/items/Get-SSMParametersByPath.html
Examples:
First example with little complexity for people like me who struggled to understand how this works:
def read_ssm_parameters():
page_iterator = paginator.paginate(
Path='path_to_the_parameters',
Recursive=True,
PaginationConfig={
'MaxItems': 10,
'PageSize': 10,
}
)
while myNextToken:
for page in page_iterator:
print('# This is new page')
print(page['Parameters'])
if 'NextToken' in page.keys():
print(page['NextToken'])
myNextToken=page['NextToken']
else:
myNextToken=False
page_iterator = paginator.paginate(
Path=baseSSMPath,
Recursive=True,
PaginationConfig={
'MaxItems': 10,
'PageSize': 10,
'StartingToken': myNextToken
}
)
Second example with reduced code but without the complexity of using recursion
def read_ssm_parameters(myNextToken='None'):
while myNextToken:
page_iterator = paginator.paginate(
Path='path_to_the_parameters',
Recursive=True,
PaginationConfig={
'MaxItems': 10,
'PageSize': 10,
'StartingToken': myNextToken
}
)
for page in page_iterator:
if 'NextToken' in page.keys():
print('# This is a new page')
myNextToken=page['NextToken']
print(page['Parameters'])
else:
# Exit if there are no more pages to read
myNextToken=False
Hope this helps!

I will post my solution here and hopefully help other people do their job faster instead of fiddling around with the amazingly written boto3 api calls.
My use case was to list all the Security Hub ControlIds using the SecurityHub.Client.describe_standards_controls function.
controlsResponse = sh_client.describe_standards_controls(
StandardsSubscriptionArn = enabledStandardSubscriptionArn)
controls = controlsResponse.get('Controls')
# This is the token for the 101st item in the list.
nextToken = controlsResponse.get('NextToken')
# Call describe_standards_controls with the token set at item 101 to get the next 100 results
controlsResponse1 = sh_client.describe_standards_controls(
StandardsSubscriptionArn = enabledStandardSubscriptionArn, NextToken=nextToken)
controls1 = controlsResponse1.get('Controls')
# And make the two lists into one
controls.extend(controls1)
No you have a list of all the SH standards controls for the specified Subscription Standard(e.g., AWS foundational Standard)
For example if I want to get all the ControlIds I can just iterate the 'controls' list and do
controlId=control.get("ControlId")
same for other field in the response as it is described here

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Refreshing boto3 session when paginating though cloudtrail - python-3.x

Related

Cassandra Python driver not fetching the next page of results

deleting a key from a python DiskCache Fanout Cache

Error Using geopy library

Python 3.6 Downloading .csv files from finance.yahoo.com using requests module

How to use Boto3 pagination

Categories

Resources