get list of tables in database using boto3 - python-3.x

I’m trying to get a list of the tables from a database in my aws data catalog. I’m trying to use boto3. I’m running the code below on aws, in a sagemaker notebook. It runs forever (like over 30 minutes) and doesn’t return any results. The test_db only has 4 tables in it. My goal is to run similar code as part of an aws glue etl job, that I would run in an edited aws etl job script. Does anyone see what the issue might be or suggest how to do this?
code:
import boto3
from pprint import pprint
glue = boto3.client('glue', region_name='us-east-2')
response = glue.get_tables(
DatabaseName=‘test_db’
)
print(pprint(response['TableList']))

db = session.resource('dynamodb', region_name="us-east-2")
tables = list(db.tables.all())
print(tables)
resource
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/dynamodb.html

Related

How to directly read excel file from s3 with pandas in airflow dag?

I am trying to read an excel file from s3 inside an aiflow dag with python, but it does not seem to work. It is very weird because it works when I read it from outside airflow with pd.read_excel(s3_excel_path).
What I did :
Set AWS credential in airflow (this works well as I can list my s3 bucket)
Install pandas, s3fs in my Docker environment where I run Airflow
Try to read the file with pd.read_excel(s3_excel_path)
As I said, it works when I try it outside of Airflow. Moreover, I don't get any error, the dag just continues to run undefinitely (at the step where it is supposed to read the file) and nothing happens, even if I wait 20 minutes.
(I would like to avoir to download the file from s3, process it and then upload it back to s3, which is why I am trying to read it directly from s3)
Note: I does not work with csv as well.
EDIT : Likewise, I can't save my dataframe directly to S3 with df.to_csv('s3_path') in airflow dag while I can do it in python
To read data files stored in S3 using pandas, you have two options, download them using boto3 (or AWS CLI) and read local files, which is the solution you are not locking for, and use s3fs API supported by pandas:
import os
import pandas as pd
AWS_S3_BUCKET = os.getenv("AWS_S3_BUCKET")
AWS_ACCESS_KEY_ID = os.getenv("AWS_ACCESS_KEY_ID")
AWS_SECRET_ACCESS_KEY = os.getenv("AWS_SECRET_ACCESS_KEY")
AWS_SESSION_TOKEN = os.getenv("AWS_SESSION_TOKEN")
key = "path/to/excel/file"
books_df = pd.read_excel(
f"s3://{AWS_S3_BUCKET}/{key}",
storage_options={
"key": AWS_ACCESS_KEY_ID,
"secret": AWS_SECRET_ACCESS_KEY,
"token": AWS_SESSION_TOKEN,
},
)
to use this solution, you need to install s3fs and apache-airflow-providers-amazon
pip install s3fs
pip install apache-airflow-providers-amazon

How to call Cluster API and start cluster from within Databricks Notebook?

Currently we are using a bunch of notebooks to process our data in azure databricks using mainly python/pyspark.
What we want to achieve is make sure that our clusters are started (warmed up) before initiating the data processing. For that reason we are exploring ways to get access to the Cluster API from within databricks notebooks.
So far we tried running the following:
import subprocess
cluster_id = "XXXX-XXXXXX-XXXXXXX"
subprocess.run(
[f'databricks clusters start --cluster-id "{cluster_id}"'], shell=True
)
which however returns below and nothing really happens afterwards. Cluster is not started.
CompletedProcess(args=['databricks clusters start --cluster-id "0824-153237-ovals313"'], returncode=127)
Is there any convenient and smart way to call the ClusterAPI from within databricks notebook or maybe call a curl command and how is this achieved?
Most probably the error is coming from the incorrectly configured credentials.
Instead of using command-line application it's better to use the Start command of Clusters REST API. This could be done with something like this:
import requests
ctx = dbutils.notebook.entry_point.getDbutils().notebook().getContext()
host_name = ctx.tags().get("browserHostName").get()
host_token = ctx.apiToken().get()
cluster_id = "some_id" # put your cluster ID here
requests.post(
f'https://{host_name}/api/2.0/clusters/get',
json = {'cluster_id': cluster_id},
headers={'Authorization': f'Bearer {host_token}'}
)
and then you can monitor the status using the Get endpoint until it gets into the RUNNING state:
response = requests.get(
f'https://{host_name}/api/2.0/clusters/get?cluster_id={cluster_id}',
headers={'Authorization': f'Bearer {host_token}'}
).json()
status = response['state']

Run python Flask API on AWS EC2 through boto3

I'm new at AWS, so I'm building a code to create a instance from an Image and I want that at the same time that this EC2 is created it run a pyhton code like this:
python /folder/folder2/api_flask.py
Here's my code on boto to create my instance.
import boto3
client = boto3.session('ec2')
client.run_instances(ImageId='ami-id_number_of_img', MinCount=1, MaxCount=1, InstanceType='t2.nano')
Thnks for your help.
run_instances has an option called UserData which allows you to Run commands on your Linux instance at launch.
Thus to run your code, you can try to the following:
import boto3
client = boto3.client('ec2') # not boto3.session('ec2')
client.run_instances(ImageId='ami-id_number_of_img',
MinCount=1,
MaxCount=1,
InstanceType='t2.nano',
UserData='#!/bin/bash\npython /folder/folder2/api_flask.py\n')
Since you mention you are new to AWS, consider using CloudFormation for provisioning AWS Infrastructure. You'll still need leverage UserData as Marcin mentioned.
MyInstance:
Type: AWS::EC2::Instance
Properties:
UserData:
Fn::Base64: !Sub |
python /folder/folder2/api_flask.py
InstanceType: t2.nano
ImageId: ami-id_number_of_img
Why CloudFormation? It'd be more readable, an allows for in-place updates as well as tear downs. You could then launch the stack via boto3 (disclaimer: not tested, but demonstrates the idea):
import boto3
client = boto3.client('cloudformation')
with open('mytemplate.yml', 'r') as f:
response = client.create_stack(
StackName='my-stack',
TemplateBody=f.read())

list_datasets() method does nothing in AWS Lambda

I am trying to get the list of datasets from BigQuery inside the AWS lambda. But, while executing the client.list_datasets() method it does nothing and lambda is timed out.
My code is as follows:
from google.cloud.bigquery import Client
from google.oauth2.service_account import Credentials
credentials = Credentials.from_service_account_info(
service_account_dict)
client = Client(
project=service_account_dict.get("project_id"),
credentials=credentials
)
datasets = client.list_datasets()
print(datasets)
for dataset in datasets:
print("dataset info", dataset.__dict__)
The output of first print statement is:
<google.api_core.page_iterator.HTTPIterator object at 0x7fbae4975550>
But, the second print for dataset.__dict__ is not being printed. Or, looping over the HTTPIterator object is not performed.
BTW, the code works perfectly fine in local machine.
The AWS VPC that I used in lambda function was causing this issue. The VPC blocked requests to the external API (in my case BigQuery API).
Configuring the VPC subnet and NAT Gateway to expose lambda function to the internet (0.0.0.0/0) solved the issue.

AWS - Neptune restore from snapshot using SDK

I'm trying to test restoring Neptune instances from a snapshot using python (boto3). Long story short, we want to spin up and delete the Dev instance daily using automation.
When restoring, my restore seems to only create the cluster without creating the attached instance. I have also tried creating an instance once the cluster is up and add to the cluster, but that doesn't work either. (ref: client.create_db_instance)
My code does as follows, get the most current snapshot. Use that variable to create the cluster so the most recent data is there.
import boto3
client = boto3.client('neptune')
response = client.describe_db_cluster_snapshots(
DBClusterIdentifier='neptune',
MaxRecords=100,
IncludeShared=False,
IncludePublic=False
)
snaps = response['DBClusterSnapshots']
snaps.sort(key=lambda c: c['SnapshotCreateTime'], reverse=True)
latest_snapshot = snaps[0]
snapshot_ID = latest_snapshot['DBClusterSnapshotIdentifier']
print("Latest snapshot: " + snapshot_ID)
db_response = client.restore_db_cluster_from_snapshot(
AvailabilityZones=['us-east-1c'],
DBClusterIdentifier='neptune-test',
SnapshotIdentifier=snapshot_ID,
Engine='neptune',
Port=8182,
VpcSecurityGroupIds=['sg-randomString'],
DBSubnetGroupName='default-vpc-groupID'
)
time.sleep(60)
db_instance_response = client.create_db_instance(
DBName='neptune',
DBInstanceIdentifier='brillium-neptune',
DBInstanceClass='db.r4.large',
Engine='neptune',
DBSecurityGroups=[
'sg-string',
],
AvailabilityZone='us-east-1c',
DBSubnetGroupName='default-vpc-string',
BackupRetentionPeriod=7,
Port=8182,
MultiAZ=False,
AutoMinorVersionUpgrade=True,
PubliclyAccessible=False,
DBClusterIdentifier='neptune-test',
StorageEncrypted=True
)
The documentation doesn't help much at all. It's very good at providing the variables needed for basic creation, but not the actual instance. If I attempt to create an instance using the same Cluster Name, it either errors out or creates a new cluster with the same name appended with '-1'.
If you want to programmatically do a restore from snapshot, then you need to:
Create the cluster snapshot using create-db-cluster-snapshot
Restore cluster from snapshot using restore-db-cluster-from-snapshot
Create an instance in the new cluster using create-db-instance
You mentioned that you did do a create-db-instance call in the end, but your example snippet does not have it. If that call did succeed, then you should see an instance provisioned inside that cluster.
When you do a restore from Snapshot using the Neptune Console, it does steps #2 and #3 for you.
It seems like you did the following:
Create the snapshot via CLI
Create the cluster via CLI
Create an instance in the cluster, via Console
Today, we recommend restoring the snapshot entirely via the Console or entirely using the CLI.

Resources