Python for loop for multiple postgres queries based on dates - python-3.x

I want to create a for loop that queries data from my database one week at a time for three months.
For example:
import tempfile
import pandas as pd
DB_URL = 'db/url/here:1234'
engine = sqlalchemy.create_engine(DB_URL)
conn = engine.connect()
queries=
{'d_week1_start':2017-12-1,
'd_week1_end':2017-12-7,
'd_week2_start':2017-12-7,
'd_week2_end':2017-12-14,
'd_week3_start': 2017-12-14,
'd_week3_end':2017-12-21,
'd_week4_start': 2017-12-21,
'd_week4_end':2017-12-31}
for qr in queries:
one_week_df = pd.read_sql(qr, conn)
one_week_df.to_pickle('one_week.pkl')
How would I go about doing something like this?
Update: I realized I wasn't passing any queries. I changed the items in the dictionary to queries like:
'''SELECT * FROM table WHERE time > '20180201T120000'' AND time < '20180207T120000' ORDER BY time ASC'''

Related

how to compare datetime using psycopg2 in python3?

I wish to execute the statement to delete records from a postgresql table older than 45 days in my python script as below:
Consider only the code below:
import psycopg2
from datetime import datetime
cur = conn.cursor()
mpath = None
sql1 = cur.execute(
"Delete from table1 where mdatetime < datetime.today() - interval '45 days'")
This causes the following error:
psycopg2.errors.InvalidSchemaName: schema "datetime" does not exist
LINE 1: Delete from logsearch_maillogs2 where mdatetime <
datetime.t...
How do I exactly change the format or resolve this. Do I need to convert. Saw a few posts which say that postgresql DateTime doesn't exist in PostgreSQL etc, but didn't find exact code to resolve this issue. Please guide.
The query is running in Postgres not Python you need to use SQL timestamp function not Python ones if you are writing a hard coded string. So datetime.today() --> now() per Current Date/time.
sql1 = cur.execute(
"Delete from table1 where mdatetime < now() - interval '45 days'")
Or you need to use parameters per here Parameters to pass in a Python datetime if you want a dynamic query.
sql1 = cur.execute(
"Delete from table1 where mdatetime < %s - interval '45 days'", [datetime.today()])

Multiple WHERE conditions in Pandas read_sql

I've got my data put into an SQLite3 database, and now I'm trying to work on a little script to access data I want for given dates. I got the SELECT statement to work with the date ranges, but I can't seem to add another condition to fine tune the search.
db columns id, date, driverid, drivername, pickupStop, pickupPkg, delStop, delPkg
What I've got so far:
import pandas as pd
import sqlite3
sql_data = 'driverperformance.sqlite'
conn = sqlite3.connect(sql_data)
cur = conn.cursor()
date_start = "2021-12-04"
date_end = "2021-12-10"
df = pd.read_sql_query("SELECT DISTINCT drivername FROM DriverPerf WHERE date BETWEEN :dstart and :dend", params={"dstart": date_start, "dend": date_end}, con=conn)
drivers = df.values.tolist()
for d in drivers:
driverDF = pd.read_sql_query("SELECT * FROM DriverPerf WHERE drivername = :driver AND date BETWEEN :dstart and :dend", params={"driver": d, "dstart": date_start, "dend": date_end}, con=conn)
I've tried a few different versions of the "WHERE drivername" part but it always seems to fail.
Thanks!
If I'm not mistaken, drivers will be a list of lists. Have you tried
.... params={"driver": d[0] ....

Pandas .to_sql fails silently randomly

I have several large pandas dataframes (about 30k+ rows) and need to upload a different version of them daily to a MS SQL Server db. I am trying to do so with the to_sql pandas function. On occasion, it will work. Other times, it will fail - silently - as if the code uploaded all of the data despite not having uploaded a single row.
Here is my code:
class SQLServerHandler(DataBaseHandler):
...
def _getSQLAlchemyEngine(self):
'''
Get an sqlalchemy engine
from the connection string
The fast_executemany fails silently:
https://stackoverflow.com/questions/48307008/pandas-to-sql-doesnt-insert-any-data-in-my-table/55406717
'''
# escape special characters as required by sqlalchemy
dbParams = urllib.parse.quote_plus(self.connectionString)
# create engine
engine = sqlalchemy.create_engine(
'mssql+pyodbc:///?odbc_connect={}'.format(dbParams))
return engine
#logExecutionTime('Time taken to upload dataframe:')
def uploadData(self, tableName, dataBaseSchema, dataFrame):
'''
Upload a pandas dataFrame
to a database table <tableName>
'''
engine = self._getSQLAlchemyEngine()
dataFrame.to_sql(
tableName,
con=engine,
index=False,
if_exists='append',
method='multi',
chunksize=50,
schema=dataBaseSchema)
Switching the method to None seems to work properly but the data takes an insane amount of time to upload (30+ mins). Having multiple tables (20 or so) a day of this size discards this solution.
The proposed solution here to add the schema as a parameter doesn't work. Neither does creating a sqlalchemy session and passsing it to the con parameter with session.get_bind().
I am using:
ODBC Driver 17 for SQL Server
pandas 1.2.1
sqlalchemy 1.3.22
pyodbc 4.0.30
Does anyone know how to make it raise an exception if it fails?
Or why it is not uploading any data?
In rebuttal to this answer, if to_sql() was to fall victim to the issue described in
SQL Server does not finish execution of a large batch of SQL statements
then it would have to be constructing large anonymous code blocks of the form
-- Note no SET NOCOUNT ON;
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (0, 'row0');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (1, 'row1');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (2, 'row2');
…
and that is not what to_sql() is doing. If it were, then it would start to fail well below 1_000 rows, at least on SQL Server 2017 Express Edition:
import pandas as pd
import pyodbc
import sqlalchemy as sa
print(pyodbc.version) # 4.0.30
table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}") # 400
cnxn = pyodbc.connect("DSN=mssqlLocal64", autocommit=True)
crsr = cnxn.cursor()
crsr.execute(f"TRUNCATE TABLE {table_name}")
sql = "".join(
[
f"INSERT INTO {table_name} ([id], [txt]) VALUES ({i}, 'row{i}');"
for i in range(num_rows)
]
)
crsr.execute(sql)
row_count = crsr.execute(f"SELECT COUNT(*) FROM {table_name}").fetchval()
print(f"row_count: {row_count}") # 316
Using to_sql() for that same operation works
import pandas as pd
import pyodbc
import sqlalchemy as sa
print(pyodbc.version) # 4.0.30
table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}") # 400
df = pd.DataFrame(
[(i, f"row{i}") for i in range(num_rows)], columns=["id", "txt"]
)
engine = sa.create_engine(
"mssql+pyodbc://#mssqlLocal64", fast_executemany=True
)
df.to_sql(
table_name,
engine,
index=False,
if_exists="replace",
)
with engine.connect() as conn:
row_count = conn.execute(
sa.text(f"SELECT COUNT(*) FROM {table_name}")
).scalar()
print(f"row_count: {row_count}") # 400
and indeed will work for thousands and even millions of rows. (I did a successful test with 5_000_000 rows.)
Ok, this seems to be an issue with SQL Server itself.
SQL Server does not finish execution of a large batch of SQL statements

how to fetch and save collection data quicker from MongoDB to a csv file in Python?

I have a collection in MongoDB with 4,040,989 records.
I am fetching and saving it in a csv in following way-
import pandas as pd
import pymongo
connection_string=''
myclient = pymongo.MongoClient(connection_string)
mydb = myclient[""]
mycol = mydb[""]
x = mycol.find()
df = pd.DataFrame(list(x))
print(df)
df.to_csv('Db_collection.csv')
The above way is taking so much time (it has been 30 minutes still continues) for this task.
Is there any inbuilt or custom efficient way to complete this process faster with Python?
I am using Python 3.7, PyMongo 3.10.1 and Windows 10 with GPU.

Need better approach to load oracle blob data into Mongodb collection using Gridfs

Recently, I started working on new project where I need to transfer oracle table data into Mongodb collections.
Oracle table consists one BLOB datatype column.
I wanted to transfer oracle table blob data into Mongodb using GridFS and I even succeed, but I am unable to scale it up.
If I use same script for 10k or 50k records, Its taking very very long time.
Please suggest me, is there anywhere i can improve or is there better way to achieve my goal.
Thank you in advance.
Please find out sample code which I am using to load small amount of data
from pymongo import MongoClient
import cx_Oracle
from gridfs import GridFS
import pickle
import sys
client = MongoClient('localhost:27017/sample')
dbm = client.sample
db = <--oracle connection----->
cursor = db.cursor()
def get_notes_file_sys():
return GridFS(dbm,'notes')
def save_data_in_file(fs,note,file_name):
gridin = None
file_ids = {}
data_blob = pickle.dumps(note['file_content_blob'])
del note['file_content_blob']
gridin = fs.open_upload_stream(file_name, chunk_size_bytes=261120, metadata=note)
gridin.write(data_blob)
gridin.close()
file_ids['note_id'] = gridin._id
return file_ids
# ---------------------------Uploading files start---------------------------------------
fs = get_notes_file_sys()
query = ("""SELECT id, file_name, file_content_blob, author, created_at FROM notes fetch next 10 rows only""")
cursor.execute(query)
rows = cursor.fetchall()
col = [co[0] for co in cursor.description]
final_arr= []
for row in rows:
data = dict(zip(col,row))
file_name = data['file_name']
if data["file_content_blob"] is None:
data["file_content_blob"] = None
else:
# This below line is taking more time
data["file_content_blob"] = data["file_content_blob"].read()
note_id = save_data_in_file(fs,data,file_name)
data['note_id'] = note_id
final_arr.append(data)
dbm['notes'].bulk_insert(final_arr)
Two things comes to mind:
Don't move to Mongo. Just use Oracle's SODA document storage model: https://cx-oracle.readthedocs.io/en/latest/user_guide/soda.html Also take a look at Oracle's JSON DB service: https://blogs.oracle.com/jsondb/autonomous-json-database
Fetch BLOBs as Bytes, which is much faster than the method you are using https://cx-oracle.readthedocs.io/en/latest/user_guide/lob_data.html#fetching-lobs-as-strings-and-bytes There is an example at https://github.com/oracle/python-cx_Oracle/blob/master/samples/ReturnLobsAsStrings.py

Resources