Any optimize way to iterate excel and provide data into pd.read_sql() as a string one by one - python-3.x

#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.

You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:

Related

Remove columns starting with same special character in a csv file using Python

My csv file has below columns:
AFM_reversal_indicator,Alert_Message,axiom_key,_timediff,player,__mv_Splunk_Alert_Id,__mv_nbr_plastic,__mv_code_acct_stat_demo.
I want to remove columns starting with "__mv".
I saw some posts where pandas are used to filter out columns.
Is it possible to do it using csv module in python, If yes how ?
Also, with Pandas what regex should I give:
df.filter(regex='')
df.to_csv(output_file_path)
P.S I am using python3.8
You mean with standard python? You can use a list comprehension, e.g.
import csv
with open( 'data.csv', 'r' ) as f:
DataGenerator = csv.reader( f )
Header = next( DataGenerator )
Header = [ Col.strip() for Col in Header ]
Data = list( DataGenerator )
if Data[-1] == []: del( Data[-1] )
Data = [ [Row[i] for i in range( len( Header ) ) if not Header[i].startswith( "__mv" ) ] for Row in Data ]
Header = [ Col for Col in Header if not Col.startswith( "__mv" ) ]
However, this is just a simple example. You'll probably have further things to consider, e.g. what type your csv columns have, whether you want to read all the data at once like I do here, or one-by-one from the generator to save on memory, etc.
You could also use the builtin filter command instead of the inner list comprehension.
Also, if you have numpy installed and you wanted something more 'numerical', you can always use "structured numpy arrays" (https://numpy.org/doc/stable/user/basics.rec.html). They're quite nice. (personally I prefer them to pandas anyway). Numpy also has its own csv-reading functions (see: https://www.geeksforgeeks.org/how-to-read-csv-files-with-numpy/)
You don't need to use .filter for that. You can just find out which are those columns and then drop them from the DataFrame
import pandas as pd
# Load the dataframe (In our case create a dummy one with your columns)
df = pd.DataFrame(columns = ["AFM_reversal_indicator", "Alert_Message,axiom_key", "_timediff,player", "__mv_Splunk_Alert_Id", "__mv_nbr_plastic", "__mv_code_acct_stat_demo"])
# Get a list of all column names starting with "__mv"
mv_columns = [col for col in df.columns if col.startswith("__mv")]
# Drop the columns
df = df.drop(columns=mv_columns)
# Save the updated dataframe to a CSV file
df.to_csv("cleaned_data.csv", index=False)
The mv_columns will iterate through the columns in your DataFrame and pick those that starts with "__mv". Then the .drop will just remove those from it.
If for some reason you want to use csv package only, then the solution might not be as elegant as with pandas. But here is a suggestion:
import csv
with open("original_data.csv", "r") as input_file, open("cleaned_data.csv", "w", newline="") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header_row = next(reader)
mv_columns = [col for col in header_row if col.startswith("__mv")]
mv_column_indices = [header_row.index(col) for col in mv_columns]
new_header_row = [col for col in header_row if col not in mv_columns]
writer.writerow(new_header_row)
for row in reader:
new_row = [row[i] for i in range(len(row)) if i not in mv_column_indices]
writer.writerow(new_row)
So, first, you read the first row that supposed to be your headers. With a similar logic as before, you find those columns that starts with "__mv" and then you get their indices. You write the new columns to your output file with those columns that don't exist to the "__mv" columns. Then you need to iterate through the rest of the CSV and remove those columns as you go.

pandas data types changed when reading from parquet file?

I am brand new to pandas and the parquet file type. I have a python script that:
reads in a hdfs parquet file
converts it to a pandas dataframe
loops through specific columns and changes some values
writes the dataframe back to a parquet file
Then the parquet file is imported back into hdfs using impala-shell.
The issue I'm having appears to be with step 2. I have it print out the contents of the dataframe immediately after it reads it in and before any changes are made in step 3. It appears to be changing the datatypes and the data of some fields, which causes problems when it writes it back to a parquet file. Examples:
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
fields that should be an Int with a value of 0 in the database are changed to "0.00000" and turned into a float in the dataframe.
It appears that it is actually changing these values, because when it writes the parquet file and I import it into hdfs and run a query, I get errors like this:
WARNINGS: File '<path>/test.parquet' has an incompatible Parquet schema for column
'<database>.<table>.tport'. Column type: INT, Parquet schema:
optional double tport [i:1 d:1 r:0]
I don't know why it would alter the data and not just leave it as-is. If this is what's happening, I don't know if I need to loop over every column and replace all these back to their original values, or if there is some other way to tell it to leave them alone.
I have been using this reference page:
http://arrow.apache.org/docs/python/parquet.html
It uses
pq.read_table(in_file)
to read the parquet file and then
df = table2.to_pandas()
to convert to a dataframe that I can loop through and change the columns. I don't understand why it's changing the data, and I can't find a way to prevent this from happening. Is there a different way I need to read it than read_table?
If I query the database, the data would look like this:
tport
0
1
My print(df) line for the same thing looks like this:
tport
0.00000
nan
nan
1.00000
Here is the relevant code. I left out the part that processes the command-line arguments since it was long and it doesn't apply to this problem. The file passed in is in_file:
import sys, getopt
import random
import re
import math
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow as pa
import os.path
# <CLI PROCESSING SECTION HERE>
# GET LIST OF COLUMNS THAT MUST BE SCRAMBLED
field_file = open('scrambler_columns.txt', 'r')
contents = field_file.read()
scrambler_columns = contents.split('\n')
def scramble_str(xstr):
#print(xstr + '_scrambled!')
return xstr + '_scrambled!'
parquet_file = pq.ParquetFile(in_file)
table2 = pq.read_table(in_file)
metadata = pq.read_metadata(in_file)
df = table2.to_pandas() #dataframe
print('rows: ' + str(df.shape[0]))
print('cols: ' + str(df.shape[1]))
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
#df.fillna(value='', inplace=True) # np.nan # \xa0
print(df) # print before making any changes
cols = list(df)
# https://pythonbasics.org/pandas-iterate-dataframe/
for col_name, col_data in df.iteritems():
#print(cols[index])
if col_name in scrambler_columns:
print('scrambling values in column ' + col_name)
for i, val in col_data.items():
df.at[i, col_name] = scramble_str(str(val))
print(df) # print after making changes
print(parquet_file.num_row_groups)
print(parquet_file.read_row_group(0))
# WRITE NEW PARQUET FILE
new_table = pa.Table.from_pandas(df)
writer = pq.ParquetWriter(out_file, new_table.schema)
for i in range(1):
writer.write_table(new_table)
writer.close()
if os.path.isfile(out_file) == True:
print('wrote ' + out_file)
else:
print('error writing file ' + out_file)
# READ NEW PARQUET FILE
table3 = pq.read_table(out_file)
df = table3.to_pandas() #dataframe
print(df)
EDIT
Here are the datatypes for the 1st few columns in hdfs
and here are the same ones that are in the pandas dataframe:
id object
col1 float64
col2 object
col3 object
col4 float64
col5 object
col6 object
col7 object
It appears to convert
String to object
Int to float64
bigint to float64
How can I tell pandas what data types the columns should be?
Edit 2: I was able to find a workaround by directly processing the pyarrow tables. Please see my question and answers here: How to update data in pyarrow table?
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
This is expected. It's just how pandas print function is defined.
It appears to convert String to object
This is also expected. Numpy/pandas does not have a dtype for variable length strings. It's possible to use a fixed-length string type but that would be pretty unusual.
It appears to convert Int to float64
This is also expected since the column has nulls and numpy's int64 is not nullable. If you would like to use Pandas's nullable integer column you can do...
def lookup(t):
if pa.types.is_integer(t):
return pd.Int64Dtype()
df = table.to_pandas(types_mapper=lookup)
Of course, you could create a more fine grained lookup if you wanted to use both Int32Dtype and Int64Dtype, this is just a template to get you started.

Python3 - Return CSV with row-level errors for missing data

New to Python. I'm importing a CSV, then if any data is missing I need to return a CSV with an additional column to indicate which rows are missing data. Colleague suggested that I import CSV into a dataframe, then create a new dataframe with a "Comments" column, fill it with a comment on the intended rows, and append it to the original dataframe. I'm stuck at the step of filling my new dataframe, "dferr", with the correct number of rows that would match up to "dfinput".
Have Googled, "pandas csv return error column where data is missing", but haven't found anything related to creating a new CSV that marks bad rows. I don't even know if the proposed way is the best way to go about this.
import pandas as pd
dfinput = None
try:
dfinput = pd.read_csv(r"C:\file.csv")
except:
print("Uh oh!")
if dfinput is None:
print("Ack!")
quit(10)
dfinput.reset_index(level=None, drop=False, inplace=True, col_level=0,
col_fill='')
dferr = pd.DataFrame(columns=['comment'])
print("Empty DataFrame", dferr, sep='\n')
Expected results: "dferr" would have an index column with number of rows equal to "dfinput", and comments on the correct rows where "dfinput" has missing values.
Actual results: "dferr" is empty.
My understanding of 'missing data' here would be null values. It seems that for every row, you want the names of null fields.
df = pd.DataFrame([[1,2,3],
[4,None,6],
[None,8,None]],
columns=['foo','bar','baz'])
# Create a dataframe of True/False, True where a criterion is met
# (in this case, a null value)
nulls = df.isnull()
# Iterate through every row of *nulls*,
# and extract the column names where the value is True by boolean indexing
colnames = nulls.columns
null_labels = nulls.apply(lambda s:colnames[s], axis=1)
# Now you have a pd.Series where every entry is an array
# (technically, a pd.Index object)
# Pandas arrays have a vectorized .str.join method:
df['nullcols'] = null_labels.str.join(', ')
The .apply() method in pandas can sometimes be a bottleneck in your code; there are ways to avoid using this, but here it seemed to be the simplest solution I could think of.
EDIT: Here's an alternate one-liner (instead of using .apply) that might cut down computation time slightly:
import numpy as np
df['nullcols'] = [colnames[x] for x in nulls.values]
This might be even faster (a bit more work is required):
np.where(df.isnull(),df.columns,'')

Python Pandas dataframe, how to integrate new columns into a new csv

guys, I need a bit help on Pandas and would appreciate greatly your inputs.
My original file looks like this:
I would like to convert it by mergering some pairs of columns (generating their averages) and returns a new file looking like this:
Also, if possible, I would also like to split the column 'RateDateTime' into two columns, one contains the date, the other contains only the time. How should I do it? I tried coding as belows but it doesn't work:
import pandas as pd
dateparse = lambda x: pd.datetime.strptime(x, '%Y/%m/%d %H:%M:%S')
df = pd.read_csv('data.csv', parse_dates=['RateDateTime'], index_col='RateDateTime',date_parser=dateparse)
a=pd.to_numeric(df['RateAsk_open'])
b=pd.to_numeric(df['RateAsk_high'])
c=pd.to_numeric(df['RateAsk_low'])
d=pd.to_numeric(df['RateAsk_close'])
e=pd.to_numeric(df['RateBid_open'])
f=pd.to_numeric(df['RateBid_high'])
g=pd.to_numeric(df['RateBid_low'])
h=pd.to_numeric(df['RateBid_close'])
df['Open'] = (a+e) /2
df['High'] = (b+f) /2
df['Low'] = (c+g) /2
df['Close'] = (d+h) /2
grouped = df.groupby('CurrencyPair')
Open=grouped['Open']
High=grouped['High']
Low=grouped['Low']
Close=grouped['Close']
w=pd.concat([Open, High,Low,Close], axis=1, keys=['Open', 'High','Low','Close'])
w.to_csv('w.csv')
Python returns:
TypeError: cannot concatenate object of type "<class 'pandas.core.groupby.groupby.SeriesGroupBy'>"; only pd.Series, pd.DataFrame, and pd.Panel (deprecated) objs are valid
Can someone help me please? Many thanks!!!
IIUYC, you don't need grouping here. You can simply update existing dataframe with new columns and specify, what columns you need to save to csv file in to_csv method. Here is example:
df['Open'] = df[['RateAsk_open', 'RateBid_open']].mean(axis=1)
df['RateDate'] = df['RateDateTime'].dt.date
df['RateTime'] = df['RateDateTime'].dt.time
df.to_csv('w.csv', columns=['CurrencyPair', 'Open', 'RateDate', 'RateTime'])

Python 3 appending to a empty DataFrame

Fairly new to coding. I have looked at a couple of other similar questions about appending DataFrames in python but could not solve the problem.
I have the data below (CSV) in an excel xls file:
Venue Name,Cost ,Restriction,Capacity
Cinema,5,over 13,50
Bar,10,over 18,50
Restaurant,15,no restriction,25
Hotel,7,no restriction,100
I am using the code below to try to "filter" out rows which have "no restriction" under the restrictions column. The code seems to work right through to the last line i.e. both print statements are giving me what I would expect.
import pandas as pd
import numpy as np
my_file = pd.ExcelFile("venue data.xlsx")
mydata = my_file.parse(0, index_col = None, na_values = ["NA"])
my_new_file = pd.DataFrame()
for index in mydata.index:
if "no restriction" in mydata.Restriction[index]:
print (mydata.Restriction[index])
print (mydata.loc[index:index])
my_new_file.append(mydata.loc[index:index], ignore_index = True)
Don't loop through dataframes. It's almost never necessary.
Use:
df2 = df[df['Restriction'] != 'no restriction']
Or
df2 = df.query("Restriction != 'no restriction'")

Resources