Using Pandas, I'm attempting to 'slice' (Sorry if that's not the correct term) segments of a dataframe out of one DF and into a new one, where every segment is stacked one on top of the other.
Code:
import pandas as pd
df = pd.DataFrame(
{
'TYPE': ['System','VERIFY','CMD','SECTION','SECTION','VERIFY','CMD','CMD','VERIFY','CMD','System'],
'DATE': [100,200,300,400,500,600,700,800,900,1000,1100],
'OTHER': [10,20,30,40,50,60,70,80,90,100,110],
'STEP': ['Power On','Start: 2','Start: 1-1','Start: 10-7','End: 10-7','Start: 3-1','Start: 10-8','End: 1-1','End: 3-1','End: 10-8','Power Off']
})
print(df)
column_headers = df.columns.values.tolist()
col_name_type = 'TYPE'
col_name_other = 'OTHER'
col_name_step = 'STEP'
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
type_df = df[df[col_name_type].isin(types_to_check)]
for row in type_df:
if 'CMD' in row:
if 'START:' in row[col_name_step].value:
idx_start = row.iloc[::-1].str.match('VERIFY').first_valid_index() #go backwards and find first VERIFY
step_match = row[col_name_step].value[6:] #get the unique ID after Start:
idx_end = df[df[col_name_step].str.endswith(step_match, na=False)].last_valid_index() #find last instance of matching unique id
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
print(df)
print(df_segments)
Nothing gets populated in my segements array so the concat function fails.
From my research I'm confident that this can be done using either .loc or .iloc, but I can't seem to get a working implementation in.
My DF:
What I am trying to make:
Any help and/or guidance would be welcome.
Edit: To clarify, I'm trying to create a new DF that is comprised of every group of rows, where the start is the "VERIFY" that comes before a "CMD" row that also contains "Start:", and the end is the matching "CMD" row that has end.
EDIT2: I think the following is something close to what I need, but I'm unsure how to get it to reliably work:
segments = []
df_blank = pd.DataFrame({'TYPE': ['BLANK ROW']}, columns = column_headers)
types_to_check = ['CMD', 'VERIFY']
cmd_check = ['CMD']
verify_check = ['VERIFY']
cmd_df = df[(df[col_name_type].isin(cmd_check))]
cmd_start_df = cmd_df[(cmd_df[col_name_step].str.contains('START:'))]
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,]
idx_start = temp_df[col_name_type].isin(verify_check).last_valid_index()
idx_end = cmd_df[cmd_df[col_name_type].str.endswith(step_name, na=False)].last_valid_index()
segments.append(df.loc[idx_start:idx_end, :])
segments.append(df_blank)
df_segments = pd.concat(segments)
you can use str.contains
segmented_df = df.loc[df['STEP'].str.contains('Start|End')]
print(segmented_df )
I created some code to accomplish the 'slicing' I wanted:
for cmd_idx in cmd_start_df.index:
step_name = df.loc[cmd_idx, col_name_step][6:]
temp_df = df.loc[:cmd_idx,:]
temp_list = temp_df[col_name_type].values.tolist()
if 'VERIFY' in temp_list:
idx_start = temp_df[temp_df[col_name_type].str.match('VERIFY')].last_valid_index()
else:
idx_start = cmd_idx
idx_end = cmd_df[cmd_df[col_name_step].str.endswith(step_name, na=False)].last_valid_index()
slides.append(df.loc[idx_start:idx_end, :])
slides.append(df_blank)
I essentially create a new DF that is a subset of the old DF up to my first START index, then I find the last_valid_index that has VERIFY, then I use that index to create a filtered DF from idx_start to idx_end and then eventually concat all those slices into one DF.
Maybe there's an easier way, but I couldn't find it.
#here I have to apply the loop which can provide me the queries from excel for respective reports:
df1 = pd.read_sql(SQLqueryB2, con=con1)
df2 = pd.read_sql(ORCqueryC2, con=con2)
if (df1.equals(df2)):
print(Report2 +" : is Pass")
Can we achieve above by something doing like this (by iterating ndarray)
df = pd.read_excel(path) for col, item in df.iteritems():
OR do the only option left to read the excel from "openpyxl" library and iterate row, columns and then provide the values. Hope I am clear with the question, if any doubt please comment me.
You are trying to loop through an excel file, run the 2 queries, see if they match and output the result, correct?
import pandas as pd
from sqlalchemy import create_engine
# add user, pass, database name
con = create_engine(f"mysql+pymysql://{USER}:{PWD}#{HOST}/{DB}")
file = pd.read_excel('excel_file.xlsx')
file['Result'] = '' # placeholder
for i, row in file.iterrows():
df1 = pd.read_sql(row['SQLQuery'], con)
df2 = pd.read_sql(row['Oracle Queries'], con)
file.loc[i, 'Result'] = 'Pass' if df1.equals(df2) else 'Fail'
file.to_excel('results.xlsx', index=False)
This will save a file named results.xlsx that mirrors the original data but adds a column named Result that will be Pass or Fail.
Example results.xlsx:
I have the following DataFrame:
df = pd.DataFrame()
df['I'] = [-1.922410e-11, -6.415227e-12, 1.347632e-11, 1.728460e-11,3.787953e-11]
df['V'] = [0,0,0,1,1]
off = df.groupby('V')['I'].mean()
I need to subtract the off values to the respective df['I'] values. In code I want something like this:
for i in df['V'].unique():
df['I'][df['V']==i] -= off.loc[i]
I want to know if there is another approach of doing this without using loops.
I have a dataframe with column which contains two different column values and their name as follows:
How Do I transform it into separate columns?
So far, I tried Following:
use df[col].apply(pd.Series) - It didn't work since data in column is not in dictionary format.
Tried separating columns by a semi-colon (";") sign but It is not a good idea since the given dataframe might have n number of column based on response.
EDIT:
Data in plain text format:
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
How about:
df2 = (df["ClusterName"]
.str.replace("Date:", "")
.str.replace("Bucket:", "")
.str.split(";", expand=True))
df2.columns = ["Date", "Bucket"]
EDIT:
Without hardcoding the variable names, here's a quick hack. You can clean it up (and make less silly variable names):
df_temp = df.ClusterName.str.split(";", expand=True)
cols = []
for col in df_temp:
df_temptemp = df_temp[col].str.split(":", expand=True)
df_temp[col] = df_temptemp[1]
cols.append(df_temptemp.iloc[0, 0])
df_temp.columns = cols
So .. maybe like this ...
Setup the data frame
d = {'ClusterName': ['Date:20191010;Bucket:All','Date:20191010;Bucket:some','Date:20191010;Bucket:All']}
df = pd.DataFrame(data=d)
df
Parse over the dataframe breaking apart by colon and semi-colon
ls = []
for index, row in df.iterrows():
splits = row['ClusterName'].split(';')
print(splits[0].split(':')[1],splits[1].split(':')[1])
ls.append([splits[0].split(':')[1],splits[1].split(':')[1]])
df = pd.DataFrame(ls, columns =['Date', 'Bucket'])
User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)