Hi I am pretty new to python I have a code that reads mysql query thorugh pandas and if there is data converts it into the csv now I need to move a step ahead and add another query and create another csv I am not sure what will be the best way to do it. Any help is appreciated thanks.
My code is something like this
def data_to_df(connection):
query = """
select * from abs
"""
data = pd.read_sql(sql=query,con=connection)
return data
def main():
# DB connection and data retrieval
cnx = database_connection(db_credentials)
print(cnx)
exit()
df = data_to_df(cnx)
Convert Dataframe to textfile
df.to_csv(file_location, sep = '|', na_rep = 'NULL', index=False, quoting=csv.QUOTE_NONE)
print('File created successfully! \n')
how can I add another query that will be executed and will create a different file altogether
Related
#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:
I am trying to write a pipeline to bring oracle database table data to aws.
It only takes a few ms to fill the dataframe, but when I try to write the dataframe to a csv-file it takes more than 2 min to write 10000 rows. In addition, one of the column's datatypes is cx_oracle lob type.
I thought this meant that it must take time to write data. So I converted the data to categorical data. But then the operation will take more memory space. Does anyone have any suggestions on how to optimize this process?
query = 'select * from tablename'
cursor.execute(query)
iter_idx = 0
while True:
results = cursor.fetchmany()
if not results:
break
iter_idx += 1
df = pd.DataFrame(results)
df.columns = field['source_field_names']
rec_count = df.shape[0]
t_rec_count += rec_count
file = generate_micro_file()
print('memory usage : \n', df.info(memory_usage='deep'))
# sd = dd.from_pandas(df, npartitions=1)
df.to_csv(file, encoding=str(encoding_type), header=False, index=False, escapechar='\\',chunksize=arraysize)
code output:
From the data access side, there is room for improvement by optimizing the fetching of rows across the network. Either by:
passing a large num_rows value to fetchmany(), see the cx_Oracle doc on [Cursor.fetchmany()[(https://cx-oracle.readthedocs.io/en/latest/api_manual/cursor.html#Cursor.fetchmany).
or increasing the value of Cursor.arraysize.
Your question didn't explain enough about your LOB usage. See the sample return_lobs_as_strings.py for optimizing fetches.
See the cx_Oracle documentation Tuning Fetch Performance.
Is there a particular reason to spend the overhead of converting to a Pandas dataframe? Why not write directly using the csv module?
Maybe something like this:
with connection.cursor() as cursor:
sql = "select * from all_objects where rownum <= 100000"
cursor.arraysize = 10000
with open("testwrite.csv", "w", encoding="utf-8") as outputfile:
writer = csv.writer(outputfile, lineterminator="\n")
results = cursor.execute(sql)
writer.writerows(results)
You should benchmark and choose the best solution.
I made a function that inserts .CSV data into BigQuery in every 5~6 seconds. I've been looking for the way to avoid duplicating the data in BigQuery after inserting. I want to remove data that has same luid but I have no idea how to remove it so is it possible to check each data of .CSV has already existed in BigQuery table before inserting .
I put row_ids parameter to avoid duplicate luid but it seems not to work well .
Could you give me any idea ?? Thanks.
def stream_upload():
# BigQuery
client = bigquery.Client()
project_id = 'test'
dataset_name = 'test'
table_name = "test"
full_table_name = dataset_name + '.' + table_name
json_rows = []
with open('./test.csv','r') as f:
for line in csv.DictReader(f):
del line[None]
line_json = dict(line)
json_rows.append(line_json)
errors = client.insert_rows_json(
full_table_name,json_rows,row_ids=[row['luid'] for row in json_rows]
)
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
print("end")
schedule.every(0.5).seconds.do(stream_upload)
while True:
schedule.run_pending()
time.sleep(0.1)
BigQuery doesn't have a native way to deal with this. You could either create a view off of this table that performs deduping or create an external cache of luids and lookup if they have already been written to BigQuery before writing and update the cache after writing new data. This could be as simple as a file cache or you could use an additional database.
I am trying to read a large table (10-15M rows) from a database into pandas dataframe and I'm using the following code:
def read_sql_tmpfile(query, db_engine):
with tempfile.TemporaryFile() as tmpfile:
copy_sql = "COPY ({query}) TO STDOUT WITH CSV {head}".format(
query=query, head="HEADER"
)
conn = db_engine.raw_connection()
cur = conn.cursor()
cur.copy_expert(copy_sql, tmpfile)
tmpfile.seek(0)
df = pandas.read_csv(tmpfile)
return df
I can use this if I have a simple query like this and I pass this into above func:
'''SELECT * from hourly_data'''
But what if I want to pass some variable into this query i.e.
'''SELECT * from hourly_data where starttime >= %s '''
Now where do I pass the parameter?
You cannot use parameters with COPY. Unfortunately that extends to the query you use inside COPY, even if you could use parameters with the query itself.
You will have to construct a query string including the parameter (beware of SQL injection) and use that with COPY.
So I have been trying to create a dataframe from a mysql database using pandas and python but I have encountered an issue which I need help on.
The issue is when writing the dataframe to excel, it only writes the last row ie, it overwrites all the previous entries and only the last row is written. Please see the code below
import pandas as pd
import numpy
import csv
with open('C:path_to_file\\extract_job_details.csv', 'r') as f:
reader = csv.reader(f)
for row in reader:
jobid = str(row[1])
statement = """select jt.job_id ,jt.vendor_data_type,jt.id as TaskId,jt.create_time as CreatedTime,jt.job_start_time as StartedTime,jt.job_completion_time,jt.worker_path, j.id as JobId from dspe.job_task jt JOIN dspe.job j on jt.job_id = j.id where jt.job_id = %(jobid)s"""",
df_mysql = pd.read_sql(statement1, con=mysql_cn)
try:
with pd.ExcelWriter(timestr+'testResult.xlsx', engine='xlsxwriter') as writer:
df_mysql.to_excel(writer, sheet_name='Sheet1')
except pymysql.err.OperationalError as error:
code, message = error.args
mysql_cn.close()
Please can anyone help me identify where I am going wrong?
PS i am a new to pandas and python.
Thanks Carlos
I'm not really sure what you're trying to do reading from disk and a database at the same time...
First, you don't need csv when you're already using Pandas:
df = pd.read_csv("path/to/input/csv")
Next you can simply provide a file path as an argument to to_excel instead of an ExcelWriter instance:
df.to_excel("path/to/desired/excel/file")
If it doesn't actually need to be an excel file you can use:
df.to_csv("path/to/desired/csv/file")