How to get pyodbc connection on AWS MWAA Airflow DAG? - python-3.x

I tried to put in requirements.txt for MWAA Airflow with pyodbc=4.0.30 and in code, made connection string like
dbconnection = pyodbc.connect("Driver={ODBC Driver 17 for SQL Server};Server="+Server+";Database="+Database+";UID="+UserID+";PWD="+Password, autocommit=True)
Now the error is Broken DAG: [/usr/local/airflow/dags/test.py] No module named 'pyodbc'
Version of Airflow: 1.10.12
There is hardly any documentation on SQL Server/ Postgres based connection on MWAA AWS documentation, especially for pyodbc connection, I earlier got this issue with lambda functions and figured it out with lambda layers, but not sure how MWAA works, any suggestions appreciated.
Please don't recommend any other technology like EC2 to host Airflow as the company is very rigid to use MWAA Airflow.

import pymssql
conn = pymssql.connect(
server=server,
user=username,
password=password,
database=database
)
query ="select IDpk,name,Remarks from TestTable"
df = pd.read_sql(query,conn)

Related

load pandas dataframe into Redshift

I am trying to load the pandas dataframe into Redshift, but it keeps giving me an error. Please guide me on the same. Need help on correcting the cluster configuration to make it work successfully.
Below is my code and error traceback:
from sqlalchemy import create_engine
import pyodbc
import psycopg2
username = "#####"
host = "redshift-cluster-****.*****.ap-south-1.redshift.amazonaws.com"
driver = "Amazon Redshift (x64)"
port = 5439
pwd = "******"
db = "dev"
table = "tablename"
rs_engine = create_engine(f"postgresql://{username}:{pwd}#{host}:{port}/{db}")
df.to_sql(table, con=rs_engine, if_exists='replace',index=False)
Traceback:
OperationalError: (psycopg2.OperationalError) connection to server at "redshift-cluster-****.****.ap-south-1.redshift.amazonaws.com" (3.109.77.136), port 5439 failed: Connection timed out (0x0000274C/10060)
Is the server running on that host and accepting TCP/IP connections?
Even tried the below options, but getting the same error,
rs_engine = create_engine(f"redshift+psycopg2://{username}#{host}:{port}/{db}")
rs_engine = create_engine(f"postgresql+psycopg2://{username}:{pwd}#{host}:{port}/{db}")
rs_engine = redshift_connector.connect(
host='redshift-cluster-####.****.ap-south-1.redshift.amazonaws.com',
database='dev',
user='****',
password='#####'
)
Also, have the Public accessible setting Enabled in Redshift cluster. Still unable to connect and load the data.
UPDATE:
Also tried using ODBC Driver, but getting the same error,
import pyodbc
cnxn = pyodbc.connect(Driver=driver,
Server=host,
Database=db,
UID=username,PWD=pwd,Port=port)
When tried to setup using ODBC Datasources app, getting the same error,

AWS Lambda issue connecting to DocumentDB

Okay, I have written an AWS Lambda function which pulls data from an API and inserts the data into a DocumentDB database. When I connect to my cluster from the shell and run my python script it works just fine and inserts the data no problem.
But, when I implement the same logic into a lambda function is does not work. Below is an example of what would work in the shell but not through a Lambda function.
import urllib3
import json
import certifi
import pymongo
from pymongo import MongoClient
# Make our connection to the DocumentDB cluster
# (Here I use the DocumentDB URI)
client = MongoClient('mongodb://admin_name_here:<insertYourPassword>my_docdb_cluster/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&retryWrites=false')
# Specify the database to use
db = client.my_db
# Specify the collection to use
col = db.my_col
col.insert_one({"name": "abcdefg"})
The above works just fine in the shell but when run in Lambda I get the following error:
[ERROR] ServerSelectionTimeoutError: my_docdb_cluster timed out, Timeout: 30s, Topology Description: <TopologyDescription id: ***********, topology_type: ReplicaSetNoPrimary, servers: [<ServerDescription (my_docdb_cluster) server_type: Unknown, rtt: None, error=NetworkTimeout(my_docdb_cluster timed out')>]>
From my understanding, this error is telling me that the replica set has no primary. But, that is not true there definitely is a primary in my replica set. Does anyone know what could be the problem here?

Connecting to Azure PostgreSQL server from python psycopg2 client

I have trouble connecting to the Azure postgres database from python. I am following the guide here - https://learn.microsoft.com/cs-cz/azure/postgresql/connect-python
I have basically the same code for setting up the connection.
But the psycopg2 and SQLalchemy throw me the same error:
OperationalError: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
I am able to connect to the instance by other client tools like dbeaver but from python it does not work.
When I investigate in Postgres logs I can see that the server actually authorized the connection but the next line says
could not receive data from client: An existing connection was forcibly closed by the remote host.
Python is 3.7
psycopg's version is 2.8.5
Azure Postgres region is in West Europe
Does someone has any suggestion on what should I try to make it work?
Thank you!
EDIT:
The issue resolved itself. I tried the same setup a few days later and it started working. Might have been something wrong with the Azure West Europe.
I had this issue too. I think I read somewhere (I forget where) that Azure has an issue with the # you have to for the username (user#serverName).
I created variables and an f-string and then it worked OK.
import sqlalchemy
username = 'user#server_name'
password = 'PassWord!'
host = 'server_name.postgres.database.azure.com'
database = 'your_database'
conn_str = f'postgresql+psycopg2://{username}:{password}#{host}/{database}'
After that:
engine = sqlalchemy.create_engine(conn_str, pool_pre_ping=True)
conn = engine.connect()
Test it with a simple SQL statement.
sql = 'SELECT * FROM public.some_table;'
results = conn.engine.execute(sql)
This was a connection in UK South. Before that it did complain about the format of the username having to use #, although the username was correct, as tested from the command line with PSQL and another SQL client.

Connecting to MongoDB in Kubernetes pod with kubernetes-client using Python

I have a MongoDB instance running on Kubernetes and I'm trying to connect to it using Python with the Kubernetes library.
I'm connecting to the context on cmd line using:
kubectl config use-context CONTEXTNAME
With Python, I'm using:
from kubernetes import client, config
config.load_kube_config(
context = 'CONTEXTNAME'
)
To connect to MongoDB in cmd line:
kubectl port-forward svc/mongo-mongodb 27083:27017 -n production &
I then open a new terminal and use PORT_FORWARD_PID=$! to connect
I'm trying to get connect to the MongoDB instance using Python with the Kubernetes-client library, any ideas as to how to accomplish the above?
Define a kubernetes service for example like this, and then reference your mongodb using a connection string similar to mongodb://<service-name>.default.svc.cluster.local
My understanding is that you need to find out your DB Client Endpoint.
That could be achieved if you follow this article MongoDB on K8s
make sure you got the URI for MongoDB.
(example)
“mongodb://mongo-0.mongo,mongo-1.mongo,mongo-2.mongo:27017/dbname\_?”
and after that, you can call your DB client in Python script.
import pymongo
import sys
##Create a MongoDB client
client = pymongo.MongoClient('mongodb://......')
##Specify the database to be used
db = client.test
##Specify the collection to be used
col = db.myTestCollection
##Insert a single document
col.insert_one({'hello':'world'})
##Find the document that was previously written
x = col.find_one({'hello':'world'})
##Print the result to the screen
print(x)
##Close the connection
client.close()
Hope that will give you an idea.
Good luck!

get list of tables in database using boto3

I’m trying to get a list of the tables from a database in my aws data catalog. I’m trying to use boto3. I’m running the code below on aws, in a sagemaker notebook. It runs forever (like over 30 minutes) and doesn’t return any results. The test_db only has 4 tables in it. My goal is to run similar code as part of an aws glue etl job, that I would run in an edited aws etl job script. Does anyone see what the issue might be or suggest how to do this?
code:
import boto3
from pprint import pprint
glue = boto3.client('glue', region_name='us-east-2')
response = glue.get_tables(
DatabaseName=‘test_db’
)
print(pprint(response['TableList']))
db = session.resource('dynamodb', region_name="us-east-2")
tables = list(db.tables.all())
print(tables)
resource
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/dynamodb.html

Resources