Loop through columns - SQLAlchemy Core - python-3.x

I am trying to loop through the columns of all the tables in my database to select empty columns. I finally used raw sql and .format to get it to work, but how do I use SQLAlchemy to achieve the same result? Here is the code I've written:
from sqlalchemy import MetaData, create_engine, select
from sqlalchemy.sql import func
engine = create_engine('...')
conn = engine.connect()
tablemeta = MetaData(bind=engine, reflect=True)
for t in tablemeta.sorted_tables:
for col in t.c:
s = select([func.count(t.c[str(col)].distinct())])
s = s.scalar()
if s <= 1:
print(s)
But this results in a KeyError.

OK I got it to work:
for t in tablemeta.sorted_tables:
for col in t.c:
s = select([func.count(t.c[col.name].distinct())])
s = s.scalar()
if s <= 1:
print(s)

Related

Multiple WHERE conditions in Pandas read_sql

I've got my data put into an SQLite3 database, and now I'm trying to work on a little script to access data I want for given dates. I got the SELECT statement to work with the date ranges, but I can't seem to add another condition to fine tune the search.
db columns id, date, driverid, drivername, pickupStop, pickupPkg, delStop, delPkg
What I've got so far:
import pandas as pd
import sqlite3
sql_data = 'driverperformance.sqlite'
conn = sqlite3.connect(sql_data)
cur = conn.cursor()
date_start = "2021-12-04"
date_end = "2021-12-10"
df = pd.read_sql_query("SELECT DISTINCT drivername FROM DriverPerf WHERE date BETWEEN :dstart and :dend", params={"dstart": date_start, "dend": date_end}, con=conn)
drivers = df.values.tolist()
for d in drivers:
driverDF = pd.read_sql_query("SELECT * FROM DriverPerf WHERE drivername = :driver AND date BETWEEN :dstart and :dend", params={"driver": d, "dstart": date_start, "dend": date_end}, con=conn)
I've tried a few different versions of the "WHERE drivername" part but it always seems to fail.
Thanks!
If I'm not mistaken, drivers will be a list of lists. Have you tried
.... params={"driver": d[0] ....

sqlAlchemy Row Count from Results

Having recently upgraded sqlAlchemy and Python to 3.8, this code no longer works to get a row count from search results, via the sqlAlchemy ORM. It seems the use of _saved_cursor._result.rows has been depreciated. (Error: AttributeError: 'LegacyCursorResult' object has no attribute '_saved_cursor')
def get_clients(db, status):
clients = Table("clients", db.metadata, autoload=True)
qry = clients.select().where(clients.c.status == status)
res = qry.execute()
rowcount = len(res._saved_cursor._result.rows)
return rowcount
We have this very ugly code that works, but this way has to loop through all the results to get the count.
def get_clients(db, status):
clients = Table("clients", db.metadata, autoload=True)
qry = clients.select().where(clients.c.status == status)
res = qry.execute()
rowcount = 0
for row in res:
rowcount += 1
return rowcount
Without using raw sql, what is the most efficient means to get the row count using sqlAlchemy ORM?
The solution is to use the func method from sqlAlchemy, and to render the results as scalar.
from sqlalchemy import Table, func
def get_clients(db, status):
clients = Table("clients", db.metadata, autoload=True)
qry = select([func.count()]).select_from(clients).where(clients.c.status == status)
row_count = qry.execute().scalar()
return row_count

How to apply multiple whereclause in sqlalchmey in dask while fetching large dataset from teradata

I am trying to fetch larger dataset from teradata using dask and sqlalchmey. I am able to apply single whereclause and able to fetch data.below is the working code
td_engine = create_engine(connString)
metadata = MetaData()
t = Table(
"table",
metadata,
Column("c1"),
schema="schema",
)
sql = select([t]).where(
t.c.c1 == 'abc',
)
)
start = perf_counter()
df = dd.read_sql_table(sql, connString, index_col="c1",schema="schema")
end = perf_counter()
print("Time taken to execute the code {}".format(end - start))
print(df.head())
but when I am trying to apply and in whereclause I am getting error
sql = select([t]).where(
and_(
t.c.c1 == 'abc',
t.c.c2 == 'xyz'
)
)
More context would be helpful. If you simply need to execute the query, have you considered using the pandas read_sql function and composing the SQL request yourself?
import teradatasql
import pandas as pd
with teradatasql.connect(host="whomooz",user="guest",password="please") as con:
df = pd.read_sql("select c1 from mytable where c1='abc' and c2='xyz'", con)
print(df.head())
Or is there a specific need to use the pandas functions to construct the SQL request?

Is there a Pyhon module to read from Oracle and Split into multiple excel files with less memory usage based on column?

I am trying to split an oracle table based on values in a column (Hospital Names). The data set is ~3 Mil rows across 66 columns. I'm trying to write data for 1 hospital from 3 different table into 1 excel workbook in 3 different sheets.
I have a running code which worked for ~700K rows but the new set is too large and I run into memory problems. I tried to modify my code to hit the database each time for a hospital name using a for loop. But I get xlsx error of closing it explicitly.
import cx_Oracle
import getpass
import xlsxwriter
import pandas as pd
path = "C:\HN\1"
p = getpass.getpass()
# Connecting to Oracle
myusername = 'CN138609'
dsn_tns = cx_Oracle.makedsn('oflc1exa03p-vip.centene.com', '1521', service_name='IGX_APP_P')
conn = cx_Oracle.connect(user=myusername, password=p, dsn=dsn_tns)
sql_4 = "select distinct hospital_name from HN_Hosp_Records"
df4 = pd.read_sql(sql_4,conn)
hospital_name = list(df4['HOSPITAL_NAME'])
for x in hospital_name:
hosp_name = {"hosp" : x}
sql_1 = "select * from HN_Hosp_Records where hospital_name = :hosp"
sql_2 = "select * from HN_CAP_Claims_Not_In_DHCS where hospital_name = :hosp"
sql_3 = "select * from HN_Denied_Claims where hospital_name = :hosp"
df1 = pd.read_sql(sql_1,conn,params=hosp_name)
df2 = pd.read_sql(sql_2,conn,params=hosp_name)
df3 = pd.read_sql(sql_3,conn,params=hosp_name)
df_dhcs = df1.loc[df1['HOSPITAL_NAME'] == x]
df_dw = df2.loc[df2['HOSPITAL_NAME'] == x]
df_denied = df3.loc[df3['HOSPITAL_NAME'] == x]
# Create a new excel workbook
writer = pd.ExcelWriter(path + x + "_HNT_P2_REC_05062019.xlsx", engine='xlsxwriter')
# Write each dataframe to a different worksheet.
df_dhcs.to_excel(writer, sheet_name="DHCS")
df_dw.to_excel(writer, sheet_name = "Not In DHCS")
df_denied.to_excel(writer, sheet_name = "Denied")
writer.close()
Here is the warning/error I'm getting. The code doesn't stop but no file is being output:
File "C:\ProgramData\Anaconda3\lib\site-packages\xlsxwriter\workbook.py", line 153, in del
raise Exception("Exception caught in workbook destructor. "
Exception: Exception caught in workbook destructor. Explicit close() may be required for workbook.
I solved it. instead of binding variable using %s was the trick.

Querying from Microsoft SQL to a Pandas Dataframe

I am trying to write a program in Python3 that will run a query on a table in Microsoft SQL and put the results into a Pandas DataFrame.
My first try of this was the below code, but for some reason I don't understand the columns do not appear in the order I ran them in the query and the order they appear in and the labels they are given as a result change, stuffing up the rest of my program:
import pandas as pd, pyodbc
result_port_mapl = []
# Use pyodbc to connect to SQL Database
con_string = 'DRIVER={SQL Server};SERVER='+ <server> +';DATABASE=' +
<database>
cnxn = pyodbc.connect(con_string)
cursor = cnxn.cursor()
# Run SQL Query
cursor.execute("""
SELECT <field1>, <field2>, <field3>
FROM result
""")
# Put data into a list
for row in cursor.fetchall():
temp_list = [row[2], row[1], row[0]]
result_port_mapl.append(temp_list)
# Make list of results into dataframe with column names
## FOR SOME REASON HERE row[1] AND row[0] DO NOT CONSISTENTLY APPEAR IN THE
## SAME ORDER AND SO THEY ARE MISLABELLED
result_port_map = pd.DataFrame(result_port_mapl, columns={'<field1>', '<field2>', '<field3>'})
I have also tried the following code
import pandas as pd, pyodbc
# Use pyodbc to connect to SQL Database
con_string = 'DRIVER={SQL Server};SERVER='+ <server> +';DATABASE=' + <database>
cnxn = pyodbc.connect(con_string)
cursor = cnxn.cursor()
# Run SQL Query
cursor.execute("""
SELECT <field1>, <field2>, <field3>
FROM result
""")
# Put data into DataFrame
# This becomes one column with a list in it with the three columns
# divided by a comma
result_port_map = pd.DataFrame(cursor.fetchall())
# Get column headers
# This gives the error "AttributeError: 'pyodbc.Cursor' object has no
# attribute 'keys'"
result_port_map.columns = cursor.keys()
If anyone could suggest why either of those errors are happening or provide a more efficient way to do it, it would be greatly appreciated.
Thanks
If you just use read_sql? Like:
import pandas as pd, pyodbc
con_string = 'DRIVER={SQL Server};SERVER='+ <server> +';DATABASE=' + <database>
cnxn = pyodbc.connect(con_string)
query = """
SELECT <field1>, <field2>, <field3>
FROM result
"""
result_port_map = pd.read_sql(query, cnxn)
result_port_map.columns.tolist()

Resources