I have a Docker container running on an AWS ECS task. Basically this container needs to get a dataframe that contains customer information in each row. Each row is then accessed and data stored in AWS S3 is read depending on the customer information in that row. Multiple metrics and pdf reports are generated for each row. In initial tests that I have performed, the customer information was provided in a csv within the container. This csv was read and then the container run successfully.
The next step in the testing was to read the customer information from a RDS instance (MySQL) hosted in AWS. So basically, I query the RDS instance and get a dataframe (analogous to the csv described above with customer information). Here, is where I am having problems. I am getting the following error after running the ECS task in Fargate (port number not shown):
HTTPConnectionPool(host='localhost', port=[port_number]): Max retries exceeded with url: /session/03b59c11-8e28-4e6a-9b66-77959c774858/window/maximize (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f515cf97580>: Failed to establish a new connection: [Errno 111] Connection refused'))
When I look at the logs in CloudWatch, I can see that the query is executed successfully and the container starts reading each row of the dataframe. However, it breaks after N amount of rows are read (or probably there is a time limit?).
This are the functions in the container that I am using to execute a given query:
def create_connection_mysql():
"""
Opens a connection to a mySQL server and returns a MySQLConnection object.
Returns
-------
connection : MySQLConnection
MySQLConnection object.
"""
# read credentials from environment file
host = os.getenv("HOST")
user = os.getenv("USER_MYSQL")
passwd = os.getenv("PASSWORD")
database = os.getenv("DATABASE")
# create connection
connection = connect(
host=host,
user=user,
passwd=passwd,
database=database,
use_unicode=True,
charset="utf8",
port=3306
)
print("Connection to MySQL DB successful")
return connection
def execute_query(connection, query):
"""
Executes a query on mySQL and returns the result as a dataframe.
Parameters
----------
connection : MySQLConnection object
A MySQLConnection object to access the database.
query : str
A valid query to perform on the provided connection.
Returns
-------
result : DataFrame
Returns a DataFrame with the results of the query.
"""
# execute query
cursor = connection.cursor(buffered=True)
cursor.execute(query)
# output as a pandas DataFrame
colnames = cursor.column_names
result = pd.DataFrame(cursor.fetchall(), columns=colnames)
# close connection and return result
cursor.close()
connection.close()
return result
Because these functions are working correctly in the container, I wonder what could be causing this problem with the RDS instance.
Related
I'm getting an error when attempting to insert from a temp table into a table that exists in Synapse, here is the relevant code:
def load_adls_data(self, schema: str, table: str, environment: str, filepath: str, columns: list) -> str:
if self.exists_schema(schema):
if self.exists_table(schema, table):
if environment.lower() == 'prod':
schema = "lvl0"
else:
schema = f"{environment.lower()}_lvl0"
temp_table = self.generate_temp_create_table(schema, table, columns)
sql0 = """
IF OBJECT_ID('tempdb..#CopyDataFromADLS') IS NOT NULL
BEGIN
DROP TABLE #CopyDataFromADLS;
END
"""
sql1 = """
{}
COPY INTO #CopyDataFromADLS FROM
'{}'
WITH
(
FILE_TYPE = 'CSV',
FIRSTROW = 1
)
INSERT INTO {}.{}
SELECT *, GETDATE(), '{}' from #CopyDataFromADLS
""".format(temp_table, filepath, schema, table, Path(filepath).name)
print(sql1)
conn = pyodbc.connect(self._synapse_cnx_str)
conn.autocommit = True
with conn.cursor() as db:
db.execute(sql0)
db.execute(sql1)
If I get rid of the insert statement and just do a select from the temp table in the script:
SELECT * FROM #CopyDataFromADLS
I get the same error in either case:
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Not able to validate external location because The remote server returned an error: (409) Conflict. (105215) (SQLExecDirectW)')
I've run the generated code for both the insert and the select in Synapse and they ran perfectly. Google has no real info on this so could someone assist with this? Thanks
pyodbc.ProgrammingError: ('42000', '[42000] [Microsoft][ODBC Driver 17 for SQL Server][SQL Server]Not able to validate external location because The remote server returned an error: (409) Conflict. (105215) (SQLExecDirectW)')
This error occurs mostly because of authentication or access.
Make sure you have blob storage contributor access.
In the copy into script, add the authentication key for blob storage, unless it is a public blob storage.
I tried to repro this using copy into statement without authentication and got the same error.
After adding authentication using SAS key data is copied successfully.
Refer the Microsoft document for permissions required for bulk load using copy into statements.
Attempting to read a view which was created on AWS Athena (based on a Glue table that points to an S3's parquet file) using pyspark over a Databricks cluster throws the following error for an unknown reason:
java.lang.IllegalArgumentException: Can not create a Path from an empty string;
The first assumption was that access permissions are missing, but that wasn't the case.
When keep researching, I found the following Databricks' post about the reason for this issue: https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system
I was able to come up with a python script to fix the problem. It turns out that this exception occurs because Athena and Presto store view's metadata in a format that is different from what Databricks Runtime and Spark expect. You'll need to re-create your views through Spark
Python script example with execution example:
import boto3
import time
def execute_blocking_athena_query(query: str, athenaOutputPath, aws_region):
athena = boto3.client("athena", region_name=aws_region)
res = athena.start_query_execution(QueryString=query, ResultConfiguration={
'OutputLocation': athenaOutputPath})
execution_id = res["QueryExecutionId"]
while True:
res = athena.get_query_execution(QueryExecutionId=execution_id)
state = res["QueryExecution"]["Status"]["State"]
if state == "SUCCEEDED":
return
if state in ["FAILED", "CANCELLED"]:
raise Exception(res["QueryExecution"]["Status"]["StateChangeReason"])
time.sleep(1)
def create_cross_platform_view(db: str, table: str, query: str, spark_session, athenaOutputPath, aws_region):
glue = boto3.client("glue", region_name=aws_region)
glue.delete_table(DatabaseName=db, Name=table)
create_view_sql = f"create view {db}.{table} as {query}"
execute_blocking_athena_query(create_view_sql, athenaOutputPath, aws_region)
presto_schema = glue.get_table(DatabaseName=db, Name=table)["Table"][
"ViewOriginalText"
]
glue.delete_table(DatabaseName=db, Name=table)
spark_session.sql(create_view_sql).show()
spark_view = glue.get_table(DatabaseName=db, Name=table)["Table"]
for key in [
"DatabaseName",
"CreateTime",
"UpdateTime",
"CreatedBy",
"IsRegisteredWithLakeFormation",
"CatalogId",
]:
if key in spark_view:
del spark_view[key]
spark_view["ViewOriginalText"] = presto_schema
spark_view["Parameters"]["presto_view"] = "true"
spark_view = glue.update_table(DatabaseName=db, TableInput=spark_view)
create_cross_platform_view("<YOUR DB NAME>", "<YOUR VIEW NAME>", "<YOUR VIEW SQL QUERY>", <SPARK_SESSION_OBJECT>, "<S3 BUCKET FOR OUTPUT>", "<YOUR-ATHENA-SERVICE-AWS-REGION>")
Again, note that this script keeps your views compatible with Glue/Athena.
References:
https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore/issues/29
https://docs.databricks.com/data/metastores/aws-glue-metastore.html#accessing-tables-and-views-created-in-other-system
I'm trying to write chunk of 100000 rows to AWS RDS PostgreSQL server.
I'm using psycopg2.8 and multiprocessing. I'm creating new connection in each process and preparing the SQL statement as well. But every time random amount of rows are getting inserted. I assume the issue is the python multiprocessing library closing wrong connections, which is mentioned here:multiprocessing module and distinct psycopg2 connections
and here:https://github.com/psycopg/psycopg2/issues/829 in one of the comment.
The RDS server logs says:
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection with an open transaction
Here is the skeleton of the code:
from multiprocessing import Pool
import csv
from psycopg2 import sql
import psycopg2
from psycopg2.extensions import connection
def gen_chunks(reader, chunksize= 10 ** 5):
"""
Chunk generator. Take a CSV `reader` and yield
`chunksize` sized slices.
"""
chunk = []
for index, line in enumerate(reader):
if (index % chunksize == 0 and index > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
def write_process(chunk, postgres_conn_uri):
conn = psycopg2.connect(dsn=postgres_conn_uri)
with conn:
with conn.cursor() as cur:
cur.execute(
'''PREPARE scrape_info_query_plan (int, bool, bool) AS
INSERT INTO schema_name.table_name (a, b, c)
VALUES ($1, $2, $3)
ON CONFLICT (a, b) DO UPDATE SET (c) = (EXCLUDED.c)
'''
)
for row in chunk:
cur.execute(
sql.SQL(
''' EXECUTE scrape_info_query_plan ({})''').format(sql.SQL(', ').join([sql.Literal(value) for value in [1,True,True]]))
)
pool = Pool()
reader = csv.DictReader('csv file path', skipinitialspace=True)
for chunk in gen_chunks(reader):
#chunk is array of row's(100000) from csv
pool.apply_async(write_process, [chunk, postgres_conn_uri])
commands to create required DB stuff:
1. CREATE DATABASE foo;
2. CREATE SCHEMA schema_name;
3. CREATE TABLE table_name (
x serial PRIMARY KEY,
a integer,
b boolean,
c boolean;
Any suggestions on this?
Note: I'm having EC2 instance with 64 vCPU and I can see 60 to 64 parallel connection on my RDS instance.
I use the following to connect my aws lambda to db:
https://www.isc.upenn.edu/accessing-mysql-databases-aws-python-lambda-function
here is the code which does the job:
def lambda_handler(event, context):
"""
This function inserts content into mysql RDS instance
"""
item_count = 0
with conn.cursor() as cur:
cur.execute("create table Employee3 (EmpID int NOT NULL, Name varchar(255) NOT NULL, PRIMARY KEY (EmpID))")
cur.execute('insert into Employee3 (EmpID, Name) values(1, "Joe")')
cur.execute('insert into Employee3 (EmpID, Name) values(2, "Bob")')
cur.execute('insert into Employee3 (EmpID, Name) values(3, "Mary")')
conn.commit()
cur.execute("select * from Employee3")
for row in cur:
item_count += 1
logger.info(row)
return "Added %d items to RDS MySQL table" %(item_count)
The problem is when I try the lambda I get a result back fine but if I change the data in db and then send the request again while the lambda container is still running I do not see the updated data and I see the old data.
But when I save the lambda after a change just to kill the current container it starts loading the latest information.
How can I fix this?
First off, double check your response isn't being cached anywhere else (API Gateway, CloudFront, or whatever else you are using to call this lambda).
All variables outside of the handler in AWS Lambda are treated as global (to the individual instance of that lambda container at least).
The issue lies with something you have instantiated outside of the handler. The issue must lie with the connection (your conn variable). Move this code inside of the handler function.
I'm trying to retrieve the SQL that makes up a stored query inside an Access database.
I'm using a combination of UcanAccess 4.0.2, and jaydebeapi and the ucanaccess console. The ultimate goal is to be able to do the following from a python script with no user intervention.
When UCanAccess loads, it successfully loads the query:
Please, enter the full path to the access file (.mdb or .accdb): /Users/.../SnohomishRiverEstuaryHydrology_RAW.accdb
Loaded Tables:
Sensor Data, Sensor Details, Site Details
Loaded Queries:
Jeff_Test
Loaded Procedures:
Loaded Indexes:
Primary Key on Sensor Data Columns: (ID)
, Primary Key on Sensor Details Columns: (ID)
, Primary Key on Site Details Columns: (ID)
, Index on Sensor Details Columns: (SiteID)
, Index on Site Details Columns: (SiteID)
UCanAccess>
When I run, from the UCanAccess console a query like
SELECT * FROM JEFF_TEST;
I get the expected results of the query.
I tried things including this monstrous query from inside a python script even using the sysSchema=True option (from here: http://www.sqlquery.com/Microsoft_Access_useful_queries.html):
SELECT DISTINCT MSysObjects.Name,
IIf([Flags]=0,"Select",IIf([Flags]=16,"Crosstab",IIf([Flags]=32,"Delete",IIf
([Flags]=48,"Update",IIf([flags]=64,"Append",IIf([flags]=128,"Union",
[Flags])))))) AS Type
FROM MSysObjects INNER JOIN MSysQueries ON MSysObjects.Id =
MSysQueries.ObjectId;
But get an object not found or insufficient privileges error.
At this point, I've tried mdbtools and can successfully retrieve metadata, and data from access. I just need to get the queries out too.
If anyone can point me in the right direction, I'd appreciate it. Windows is not a viable option.
Cheers, Seth
***********************************
* SOLUTION
***********************************
from jpype import *
startJVM(getDefaultJVMPath(), "-ea", "-Djava.class.path=/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/ucanaccess-4.0.2.jar:/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/lib/commons-lang-2.6.jar:/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/lib/commons-logging-1.1.1.jar:/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/lib/hsqldb.jar:/Users/seth.urion/local/access/UCanAccess-4.0.2-bin/lib/jackcess-2.1.6.jar")
conn = java.sql.DriverManager.getConnection("jdbc:ucanaccess:///Users/seth.urion/PycharmProjects/pyAccess/FE_Hall_2010_2016_SnohomishRiverEstuaryHydrology_RAW.accdb")
for query in conn.getDbIO().getQueries():
print(query.getName())
print(query.toSQLString())
If you can find a satisfactory way to call Java methods from within Python then you could use the Jackcess Query#toSQLString() method to extract the SQL for a saved query. For example, I just got this to work under Jython:
from java.sql import DriverManager
def get_query_sql(conn, query_name):
sql = ''
for query in conn.getDbIO().getQueries():
if query.getName() == query_name:
sql = query.toSQLString()
break
return sql
# usage example
if __name__ == '__main__':
conn = DriverManager.getConnection("jdbc:ucanaccess:///home/gord/UCanAccessTest.accdb")
query_name = 'Jeff_Test'
query_sql = get_query_sql(conn, query_name)
if query_sql == '':
print '(Query not found.)'
else:
print 'SQL for query [%s]:' % (query_name)
print
print query_sql
conn.close()
producing
SQL for query [Jeff_Test]:
SELECT Invoice.InvoiceNumber, Invoice.InvoiceDate
FROM Invoice
WHERE (((Invoice.InvoiceNumber)>1));