Python3: Multiprocessing closing psycopg2 connections to Postgres at AWS RDS - python-3.x

I'm trying to write chunk of 100000 rows to AWS RDS PostgreSQL server.
I'm using psycopg2.8 and multiprocessing. I'm creating new connection in each process and preparing the SQL statement as well. But every time random amount of rows are getting inserted. I assume the issue is the python multiprocessing library closing wrong connections, which is mentioned here:multiprocessing module and distinct psycopg2 connections
and here:https://github.com/psycopg/psycopg2/issues/829 in one of the comment.
The RDS server logs says:
LOG: could not receive data from client: Connection reset by peer
LOG: unexpected EOF on client connection with an open transaction
Here is the skeleton of the code:
from multiprocessing import Pool
import csv
from psycopg2 import sql
import psycopg2
from psycopg2.extensions import connection
def gen_chunks(reader, chunksize= 10 ** 5):
"""
Chunk generator. Take a CSV `reader` and yield
`chunksize` sized slices.
"""
chunk = []
for index, line in enumerate(reader):
if (index % chunksize == 0 and index > 0):
yield chunk
del chunk[:]
chunk.append(line)
yield chunk
def write_process(chunk, postgres_conn_uri):
conn = psycopg2.connect(dsn=postgres_conn_uri)
with conn:
with conn.cursor() as cur:
cur.execute(
'''PREPARE scrape_info_query_plan (int, bool, bool) AS
INSERT INTO schema_name.table_name (a, b, c)
VALUES ($1, $2, $3)
ON CONFLICT (a, b) DO UPDATE SET (c) = (EXCLUDED.c)
'''
)
for row in chunk:
cur.execute(
sql.SQL(
''' EXECUTE scrape_info_query_plan ({})''').format(sql.SQL(', ').join([sql.Literal(value) for value in [1,True,True]]))
)
pool = Pool()
reader = csv.DictReader('csv file path', skipinitialspace=True)
for chunk in gen_chunks(reader):
#chunk is array of row's(100000) from csv
pool.apply_async(write_process, [chunk, postgres_conn_uri])
commands to create required DB stuff:
1. CREATE DATABASE foo;
2. CREATE SCHEMA schema_name;
3. CREATE TABLE table_name (
x serial PRIMARY KEY,
a integer,
b boolean,
c boolean;
Any suggestions on this?
Note: I'm having EC2 instance with 64 vCPU and I can see 60 to 64 parallel connection on my RDS instance.

Related

Access to RDS in AWS - HTTPConnectionPool: Max retries exceeded with url

I have a Docker container running on an AWS ECS task. Basically this container needs to get a dataframe that contains customer information in each row. Each row is then accessed and data stored in AWS S3 is read depending on the customer information in that row. Multiple metrics and pdf reports are generated for each row. In initial tests that I have performed, the customer information was provided in a csv within the container. This csv was read and then the container run successfully.
The next step in the testing was to read the customer information from a RDS instance (MySQL) hosted in AWS. So basically, I query the RDS instance and get a dataframe (analogous to the csv described above with customer information). Here, is where I am having problems. I am getting the following error after running the ECS task in Fargate (port number not shown):
HTTPConnectionPool(host='localhost', port=[port_number]): Max retries exceeded with url: /session/03b59c11-8e28-4e6a-9b66-77959c774858/window/maximize (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f515cf97580>: Failed to establish a new connection: [Errno 111] Connection refused'))
When I look at the logs in CloudWatch, I can see that the query is executed successfully and the container starts reading each row of the dataframe. However, it breaks after N amount of rows are read (or probably there is a time limit?).
This are the functions in the container that I am using to execute a given query:
def create_connection_mysql():
"""
Opens a connection to a mySQL server and returns a MySQLConnection object.
Returns
-------
connection : MySQLConnection
MySQLConnection object.
"""
# read credentials from environment file
host = os.getenv("HOST")
user = os.getenv("USER_MYSQL")
passwd = os.getenv("PASSWORD")
database = os.getenv("DATABASE")
# create connection
connection = connect(
host=host,
user=user,
passwd=passwd,
database=database,
use_unicode=True,
charset="utf8",
port=3306
)
print("Connection to MySQL DB successful")
return connection
def execute_query(connection, query):
"""
Executes a query on mySQL and returns the result as a dataframe.
Parameters
----------
connection : MySQLConnection object
A MySQLConnection object to access the database.
query : str
A valid query to perform on the provided connection.
Returns
-------
result : DataFrame
Returns a DataFrame with the results of the query.
"""
# execute query
cursor = connection.cursor(buffered=True)
cursor.execute(query)
# output as a pandas DataFrame
colnames = cursor.column_names
result = pd.DataFrame(cursor.fetchall(), columns=colnames)
# close connection and return result
cursor.close()
connection.close()
return result
Because these functions are working correctly in the container, I wonder what could be causing this problem with the RDS instance.

Pandas .to_sql fails silently randomly

I have several large pandas dataframes (about 30k+ rows) and need to upload a different version of them daily to a MS SQL Server db. I am trying to do so with the to_sql pandas function. On occasion, it will work. Other times, it will fail - silently - as if the code uploaded all of the data despite not having uploaded a single row.
Here is my code:
class SQLServerHandler(DataBaseHandler):
...
def _getSQLAlchemyEngine(self):
'''
Get an sqlalchemy engine
from the connection string
The fast_executemany fails silently:
https://stackoverflow.com/questions/48307008/pandas-to-sql-doesnt-insert-any-data-in-my-table/55406717
'''
# escape special characters as required by sqlalchemy
dbParams = urllib.parse.quote_plus(self.connectionString)
# create engine
engine = sqlalchemy.create_engine(
'mssql+pyodbc:///?odbc_connect={}'.format(dbParams))
return engine
#logExecutionTime('Time taken to upload dataframe:')
def uploadData(self, tableName, dataBaseSchema, dataFrame):
'''
Upload a pandas dataFrame
to a database table <tableName>
'''
engine = self._getSQLAlchemyEngine()
dataFrame.to_sql(
tableName,
con=engine,
index=False,
if_exists='append',
method='multi',
chunksize=50,
schema=dataBaseSchema)
Switching the method to None seems to work properly but the data takes an insane amount of time to upload (30+ mins). Having multiple tables (20 or so) a day of this size discards this solution.
The proposed solution here to add the schema as a parameter doesn't work. Neither does creating a sqlalchemy session and passsing it to the con parameter with session.get_bind().
I am using:
ODBC Driver 17 for SQL Server
pandas 1.2.1
sqlalchemy 1.3.22
pyodbc 4.0.30
Does anyone know how to make it raise an exception if it fails?
Or why it is not uploading any data?
In rebuttal to this answer, if to_sql() was to fall victim to the issue described in
SQL Server does not finish execution of a large batch of SQL statements
then it would have to be constructing large anonymous code blocks of the form
-- Note no SET NOCOUNT ON;
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (0, 'row0');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (1, 'row1');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (2, 'row2');
…
and that is not what to_sql() is doing. If it were, then it would start to fail well below 1_000 rows, at least on SQL Server 2017 Express Edition:
import pandas as pd
import pyodbc
import sqlalchemy as sa
print(pyodbc.version) # 4.0.30
table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}") # 400
cnxn = pyodbc.connect("DSN=mssqlLocal64", autocommit=True)
crsr = cnxn.cursor()
crsr.execute(f"TRUNCATE TABLE {table_name}")
sql = "".join(
[
f"INSERT INTO {table_name} ([id], [txt]) VALUES ({i}, 'row{i}');"
for i in range(num_rows)
]
)
crsr.execute(sql)
row_count = crsr.execute(f"SELECT COUNT(*) FROM {table_name}").fetchval()
print(f"row_count: {row_count}") # 316
Using to_sql() for that same operation works
import pandas as pd
import pyodbc
import sqlalchemy as sa
print(pyodbc.version) # 4.0.30
table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}") # 400
df = pd.DataFrame(
[(i, f"row{i}") for i in range(num_rows)], columns=["id", "txt"]
)
engine = sa.create_engine(
"mssql+pyodbc://#mssqlLocal64", fast_executemany=True
)
df.to_sql(
table_name,
engine,
index=False,
if_exists="replace",
)
with engine.connect() as conn:
row_count = conn.execute(
sa.text(f"SELECT COUNT(*) FROM {table_name}")
).scalar()
print(f"row_count: {row_count}") # 400
and indeed will work for thousands and even millions of rows. (I did a successful test with 5_000_000 rows.)
Ok, this seems to be an issue with SQL Server itself.
SQL Server does not finish execution of a large batch of SQL statements

aiomysql select data problem: not updated

version:
Python 3.6.9
aiomysql: 0.0.20
aiohttp: 3.6.2
problem:
where mysql table data deleted or inserted, query data is not updated for hours, unless web_app restart.
codes using aiomysql pool:
# initial
pool = await aiomysql.create_pool(
# echo=True,
db=conf['database'],
user=conf['user'],
password=conf['password'],
host=conf['host'],
port=conf['port'],
minsize=conf['minsize'],
maxsize=conf['maxsize'],
)
# query
async def get_data(request)::
cmd = 'select a,b,c from tbl where d = 0'
# request.app['db'] == pool
async with request.app['db'].acquire() as conn:
async with conn.cursor() as cur:
await cur.execute(cmd)
...
current solution:
set pool_recycle=20 when aiomysql.create_pool seems solve the problem. but why? other better way?

Influxdb bulk insert using influxdb-python

I used influxDB-Python to insert a large amount of data read from the Redis-Stream. Because Redis-stream and set maxlen=600 and the data is inserted at a speed of 100ms, and I needed to retain all of its data. so I read and transfer it to influxDB(I don't know what's a better database), but using batch inserts only ⌈count/batch_size⌉ pieces of data, both at the end of each batch_size, appear to be overwritten. The following code
import redis
from apscheduler.schedulers.blocking import BlockingScheduler
import time
import datetime
import os
import struct
from influxdb import InfluxDBClient
def parse(datas):
ts,data = datas
w_json = {
"measurement": 'sensor1',
"fields": {
"Value":data[b'Value'].decode('utf-8')
"Count":data[b'Count'].decode('utf-8')
}
}
return w_json
def archived_data(rs,client):
results= rs.xreadgroup('group1', 'test', {'test1': ">"}, count=600)
if(len(results)!=0):
print("len(results[0][1]) = ",len(results[0][1]))
datas = list(map(parse,results[0][1]))
client.write_points(datas,batch_size=300)
print('insert success')
else:
print("No new data is generated")
if __name__=="__main__":
try:
rs = redis.Redis(host="localhost", port=6379, db=0)
rs.xgroup_destroy("test1", "group1")
rs.xgroup_create('test1','group1','0-0')
except Exception as e:
print("error = ",e)
try:
client = InfluxDBClient(host="localhost", port=8086,database='test')
except Exception as e:
print("error = ", e)
try:
sched = BlockingScheduler()
sched.add_job(test1, 'interval', seconds=60,args=[rs,client])
sched.start()
except Exception as e:
print(e)
The data changes following for the influxDB
> select count(*) from sensor1;
name: sensor1
time count_Count count_Value
---- ----------- -----------
0 6 6
> select count(*) from sensor1;
name: sensor1
time count_Count count_Value
---- ----------- -----------
0 8 8
> select Count from sensor1;
name: sensor1
time Count
---- -----
1594099736722564482 00000310
1594099737463373188 00000610
1594099795941527728 00000910
1594099796752396784 00001193
1594099854366369551 00001493
1594099855120826270 00001777
1594099913596094653 00002077
1594099914196135122 00002361
Why does the data appear to be overwritten, and how can I resolve it to insert all the data at a time?
I would appreciate it if you could tell me how to solve it?
Can you provide more details on the structure of data that you wish to store in the influx DB ?
However, I hope the below information helps you.
In Influxdb, timestamp + tags are unique (i.e. two data points with same tag values and timestamp cannot exist). Unlike SQL influxdb doesn't throw unique constraint violation, it overwrites the existing data with the incoming data. It seems your data doesn't have tags, so if the some incoming data whose timestamps are already present in the influxdb will override the existing data

How do I insert new data to database every minute from api in python

Hi am getting a weather data from an API every 10 minutes and am using python. I have managed to connect to the api and gets the data and gets the code to run every 10mins using thread. However the data that is being recorded every 10mins is the same without getting the new weather data. I will like it to have new rows inserted since the station updates new records every 10mins. Thanks in advance for your help.
below is my code.
timestr=datetime.now()
for data in retrieve_data():
dashboard_data = data['dashboard_data']
Dew_point = dashboard_data['Temperature'] - (100 - dashboard_data['Humidity']) / 5
weather_list = [time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(dashboard_data['time_utc'])),
dashboard_data['Temperature'],
dashboard_data['Humidity'],
Dew_point,
dashboard_data['Pressure'],
dashboard_data['Noise'],
dashboard_data['CO2']
]
def insert_weather_data(*arg):
sql = (
"""INSERT INTO weather_data(time,temperature,
humidity,
dew_point,
pressure,
noise,
CO2 ) VALUES(%s,%s,%s,%s,%s,%s,%s);
"""
)
conn = None
weather_id = None
# read database configuration
params = config()
# connect to postgres database
conn = psycopg2.connect(**params)
# create a new cursor
cur = conn.cursor()
try:
# execute the insert statement
cur.execute(sql,arg)
conn.commit()
except(Exception, psycopg2.DatabaseError) as error:
print(error)
finally:
if conn is not None:
conn.close()
def repeat_func_call():
insert_weather_data(weather_list)
threading.Timer(120, repeat_func_call, weather_list).start()
repeat_func_call()

Resources