Return a dataframe to a dataframe - python-3.x

I am trying to take a list of "ID"s from my dataframe dfrep and pass the column with the ID into a function I created in order to pass the values into a query to return back to dfrep.
My function is returning a dataframe, but the results of the dataframe are including the header and when I print dfrep there are two lines. I also cannot write the dataframe to excel using xlwings because I get TypeError: must be a pywintypes time object (got DataFrame).
def overrides(id):
sql = f"select name from sales..rep where id in({id})"
mydf = pd.read_sql(sql, conn)
return mydf
overrides = np.vectorize(overrides)
dfrep['name'] = overrides(dfrep['ID'])
wsData.range('A1').options(pd.DataFrame,index=False).value = dfrep
My goal is to load the column(s) in my function's dataframe into my main dataframe dfrep and then write to excel via xlwings. Any help is appreciated.

Related

DataFrames getting distorted while unpacking list of dataframes in Python Streamlit, with each dataframe looking different from the other

I am working on Streamlit in python.
I have created a function which allows the user to upload multiple files, reads those files one by one and stores them to a dataframe. I want this function to return all those dataframes to a separate .py file, for which I appended all the dataframes to a list i.e. list of dataframes, and here my problem started!
The list named 'all_df_list' has all the dataframes in it.
But when I unpack this list in the other python script and see the dataframes on that other script, the result is getting distorted i.e. Dataframes are not able to hold the shape and looks which it used to be, before appending it to the list.
Here is the code snippet:
import streamlit as st
import pandas as pd
def user_selection(action):
operation_mode = action
mydf = pd.DataFrame()
with st.sidebar:
if operation_mode == 'Multiple_files':
uploaded_files = st.file_uploader("Choose required files", accept_multiple_files = True)
all_df_list = []
for uploaded_file in uploaded_files:
mydf = mydf.iloc[0:0]
if uploaded_file.name is not None:
# For now, assuming the user will only upload multiple csv files
mydf = pd.read_csv(uploaded_file)
all_df_list.append(mydf)
st.write(all_df_list)
return all_df_list
Now, lets say, I called this function in another script and unpacked the list 'all_df_list' as:
*all_dfs = user_selection('Multiple_files')
st.write(all_dfs[0]) # the dataframe all_dfs[0] has lost its shape and original orientation.
How can I successfully return multiple dataframes via function, such that when I unpack them I should be able to see the dataframes as they were before appending them to the list.
Any lead on this will help. Thank you.
Just like I wrote in the comment section, I seem to see irrelevant lines of code and inappropriate code construction rather than undesired format of dataframes returned by user_selection:
If you are only looking to return list of dfs:
def user_selection(action):
operation_mode = action
mydf = pd.DataFrame()
all_df_list = []
with st.sidebar:
if operation_mode == 'Multiple_files':
uploaded_files = st.file_uploader("Choose required files", accept_multiple_files=True)
if uploaded_files:
for uploaded_file in uploaded_files:
mydf = pd.read_csv(uploaded_file)
all_df_list.append(mydf)
st.write(all_df_list)
return all_df_list
Call this function in another script:
all_dfs = user_selection('Multiple_files')
if all_dfs:
st.write(all_dfs[0])

Any optimize way to iterate excel and provide data into pd.read_sql() as a string one by one

#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:

Get the Dataframe from temp view name

I can register a Dataframe in the catalog using
df.createOrReplaceTempView("my_name")
I can check if the dataframe is registered
next(filter(lambda table:table.name=='my_name',spark.catalog.listTables()))
But how I can get the DataFrame associated with the name?
a_df = spark.catalog.getTempView()
assert id(a_df)==id(df)
spark.table() method will return the dataframe for the given table/view name
a_df = spark.table('my_name')
You may also use:
df=spark.sql("select * from my_name")
if you want to do some SQL before loading your df.

pandas data types changed when reading from parquet file?

I am brand new to pandas and the parquet file type. I have a python script that:
reads in a hdfs parquet file
converts it to a pandas dataframe
loops through specific columns and changes some values
writes the dataframe back to a parquet file
Then the parquet file is imported back into hdfs using impala-shell.
The issue I'm having appears to be with step 2. I have it print out the contents of the dataframe immediately after it reads it in and before any changes are made in step 3. It appears to be changing the datatypes and the data of some fields, which causes problems when it writes it back to a parquet file. Examples:
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
fields that should be an Int with a value of 0 in the database are changed to "0.00000" and turned into a float in the dataframe.
It appears that it is actually changing these values, because when it writes the parquet file and I import it into hdfs and run a query, I get errors like this:
WARNINGS: File '<path>/test.parquet' has an incompatible Parquet schema for column
'<database>.<table>.tport'. Column type: INT, Parquet schema:
optional double tport [i:1 d:1 r:0]
I don't know why it would alter the data and not just leave it as-is. If this is what's happening, I don't know if I need to loop over every column and replace all these back to their original values, or if there is some other way to tell it to leave them alone.
I have been using this reference page:
http://arrow.apache.org/docs/python/parquet.html
It uses
pq.read_table(in_file)
to read the parquet file and then
df = table2.to_pandas()
to convert to a dataframe that I can loop through and change the columns. I don't understand why it's changing the data, and I can't find a way to prevent this from happening. Is there a different way I need to read it than read_table?
If I query the database, the data would look like this:
tport
0
1
My print(df) line for the same thing looks like this:
tport
0.00000
nan
nan
1.00000
Here is the relevant code. I left out the part that processes the command-line arguments since it was long and it doesn't apply to this problem. The file passed in is in_file:
import sys, getopt
import random
import re
import math
import pyarrow.parquet as pq
import numpy as np
import pandas as pd
import pyarrow as pa
import os.path
# <CLI PROCESSING SECTION HERE>
# GET LIST OF COLUMNS THAT MUST BE SCRAMBLED
field_file = open('scrambler_columns.txt', 'r')
contents = field_file.read()
scrambler_columns = contents.split('\n')
def scramble_str(xstr):
#print(xstr + '_scrambled!')
return xstr + '_scrambled!'
parquet_file = pq.ParquetFile(in_file)
table2 = pq.read_table(in_file)
metadata = pq.read_metadata(in_file)
df = table2.to_pandas() #dataframe
print('rows: ' + str(df.shape[0]))
print('cols: ' + str(df.shape[1]))
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', -1)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
#df.fillna(value='', inplace=True) # np.nan # \xa0
print(df) # print before making any changes
cols = list(df)
# https://pythonbasics.org/pandas-iterate-dataframe/
for col_name, col_data in df.iteritems():
#print(cols[index])
if col_name in scrambler_columns:
print('scrambling values in column ' + col_name)
for i, val in col_data.items():
df.at[i, col_name] = scramble_str(str(val))
print(df) # print after making changes
print(parquet_file.num_row_groups)
print(parquet_file.read_row_group(0))
# WRITE NEW PARQUET FILE
new_table = pa.Table.from_pandas(df)
writer = pq.ParquetWriter(out_file, new_table.schema)
for i in range(1):
writer.write_table(new_table)
writer.close()
if os.path.isfile(out_file) == True:
print('wrote ' + out_file)
else:
print('error writing file ' + out_file)
# READ NEW PARQUET FILE
table3 = pq.read_table(out_file)
df = table3.to_pandas() #dataframe
print(df)
EDIT
Here are the datatypes for the 1st few columns in hdfs
and here are the same ones that are in the pandas dataframe:
id object
col1 float64
col2 object
col3 object
col4 float64
col5 object
col6 object
col7 object
It appears to convert
String to object
Int to float64
bigint to float64
How can I tell pandas what data types the columns should be?
Edit 2: I was able to find a workaround by directly processing the pyarrow tables. Please see my question and answers here: How to update data in pyarrow table?
fields that show up as NULL in the database are replaced with the string "None" (for string columns) or the string "nan" (for numeric columns) in the printout of the dataframe.
This is expected. It's just how pandas print function is defined.
It appears to convert String to object
This is also expected. Numpy/pandas does not have a dtype for variable length strings. It's possible to use a fixed-length string type but that would be pretty unusual.
It appears to convert Int to float64
This is also expected since the column has nulls and numpy's int64 is not nullable. If you would like to use Pandas's nullable integer column you can do...
def lookup(t):
if pa.types.is_integer(t):
return pd.Int64Dtype()
df = table.to_pandas(types_mapper=lookup)
Of course, you could create a more fine grained lookup if you wanted to use both Int32Dtype and Int64Dtype, this is just a template to get you started.

How to handle min_itemsize exception in writing to pandas HDFStore

I am using pandas HDFStore to store dfs which I have created from data.
store = pd.HDFStore(storeName, ...)
for file in downloaded_files:
try:
with gzip.open(file) as f:
data = json.loads(f.read())
df = json_normalize(data)
store.append(storekey, df, format='table', append=True)
except TypeError:
pass
#File Error
I have received the error:
ValueError: Trying to store a string with len [82] in [values_block_2] column but
this column has a limit of [72]!
Consider using min_itemsize to preset the sizes on these columns
I found that it is possible to set min_itemsize for the column involved but this is not a viable solution as I do not know the max length I will encounter and all the columns which I will encounter the problem.
Is there a solution to automatically catch this exception and handle it each item it occur?
I think you can do it this way:
store.append(storekey, df, format='table', append=True, min_itemsize={'Long_string_column': 200})
basically it's very similar to the following create table SQL statement:
create table df(
id int,
str varchar(200)
);
where 200 is the maximal allowed length for the str column
The following links might be very helpful:
https://www.google.com/search?q=pandas+ValueError%3A+Trying+to+store+a+string+with+len+in+column+but+min_itemsize&pws=0&gl=us&gws_rd=cr
HDFStore.append(string, DataFrame) fails when string column contents are longer than those already there
Pandas pytable: how to specify min_itemsize of the elements of a MultiIndex

Resources