I have a dataframe with several columns, some of them contain NaN values. I would like for each row to create another column containing the total number of columns minus the number of NaN values before the first non NaN value.
Original dataframe:
ID Value0 Value1 Value2 Value3
1 10 10 8 15
2 NaN 45 52 NaN
3 NaN NaN NaN NaN
4 NaN NaN 100 150
The extra column would look like:
ID NewColumn
1 4
2 3
3 0
4 2
Thanks in advance!
Set the index to ID
Attach a non-null column to stop/catch the argmax
Use argmax to find the first non-null value
Subtract those values from the length of the relevant columns
df.assign(
NewColumn=
df.shape[1] - 1 -
df.set_index('ID').assign(notnull=1).notnull().values.argmax(1)
)
ID Value0 Value1 Value2 Value3 NewColumn
0 1 10.0 10.0 8.0 15.0 4
1 2 NaN 45.0 52.0 NaN 3
2 3 NaN NaN NaN NaN 0
3 4 NaN NaN 100.0 150.0 2
Related
I have a excel with multiple sheets in the below format. I need to create a single dataframe by concatenating all the sheets, unmerging the cell and then transposing them into a column based on the sheet
Sheet 1:
Sheet 2:
Final Dataframe should look like below
Result expected - I need the below format with an extra coulmn as below
Code So far:
Reading File:
df = pd.concat(pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/Dig/File_2017_2022.xlsx', sheet_name=None, skiprows=1))
Creating Column :
df_1 = pd.concat([df.assign(name=n) for n,df in dfs.items()])
Use read_excel with header=[0,1] for MultiIndex by first 2 rows of header and index_col=[0,1] for MultiIndex by first 2 columns, so possible in loop reshape by DataFrame.stack, add new column, use concat and last set index names by DataFrame.rename_axis with convert to columns by DataFrame.reset_index:
dfs = pd.read_excel('Input_V1.xlsx',sheet_name=None, header=[0,1], index_col=[0,1])
df_1 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand'], columns=None)
.reset_index())
df_1.insert(len(df_1.columns) - 2, 'Campaign', df_1.pop('Campaign'))
print (df_1)
Date WK Brand A B C D E F G \
0 2017-10-02 Week 40 ABC NaN NaN NaN NaN 56892.800000 83431.664000 NaN
1 2017-10-09 Week 41 ABC NaN NaN NaN NaN 0.713716 0.474025 NaN
2 2017-10-16 Week 42 ABC NaN NaN NaN NaN 0.025936 0.072500 NaN
3 2017-10-23 Week 43 ABC NaN NaN NaN NaN 0.182677 0.926731 NaN
4 2017-10-30 Week 44 ABC NaN NaN NaN NaN 0.755607 0.686115 NaN
.. ... ... ... .. .. .. .. ... ... ..
99 2018-03-26 Week 13 PQR NaN NaN NaN NaN 47702.000000 12246.000000 NaN
100 2018-04-02 Week 14 PQR NaN NaN NaN NaN 38768.000000 46498.000000 NaN
101 2018-04-09 Week 15 PQR NaN NaN NaN NaN 35917.000000 45329.000000 NaN
102 2018-04-16 Week 16 PQR NaN NaN NaN NaN 39639.000000 51343.000000 NaN
103 2018-04-23 Week 17 PQR NaN NaN NaN NaN 50867.000000 30119.000000 NaN
H I J K L Campaign name
0 NaN NaN NaN 0.017888 0.697324 NaN ABC
1 NaN NaN NaN 0.457963 0.810985 NaN ABC
2 NaN NaN NaN 0.743030 0.253668 NaN ABC
3 NaN NaN NaN 0.038683 0.050028 NaN ABC
4 NaN NaN NaN 0.885567 0.712333 NaN ABC
.. .. .. .. ... ... ... ...
99 NaN NaN NaN 9433.000000 17108.000000 WX PQR
100 NaN NaN NaN 12529.000000 23557.000000 WX PQR
101 NaN NaN NaN 20395.000000 44228.000000 WX PQR
102 NaN NaN NaN 55077.000000 45149.000000 WX PQR
103 NaN NaN NaN 45815.000000 35761.000000 WX PQR
[104 rows x 17 columns]
I created my own version of your excel, which looks like
this
The code below is far from perfect but it should do fine as long as you do not have millions of sheets
# First, obtain all sheet names
full_df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx',
sheet_name=None, skiprows=0)
# Store them into a list
sheet_names = list(full_df.keys())
# Create an empty Dataframe to store the contents from each sheet
final_df = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx', sheet_name=sheet, skiprows=0)
# Get the brand name
brand = df.columns[1]
# Remove the header columns and keep the numerical values only
df.columns = df.iloc[0]
df = df[1:]
df = df.iloc[:, 1:]
# Set the brand name into a new column
df['Brand'] = brand
# Append into the final dataframe
final_df = pd.concat([final_df, df])
Your final_df should look like this once exported back to excel
EDIT: You might need to drop the dataframe's index upon saving it by using the df.reset_index(drop=True) function, to remove the first column shown in the image right above.
I want to add multiple empty rows at start of my dataframe. I have tried using list but it dosen't seem to return optimum result:
Example df:
Col1
col2
col3
col4
One
Two
Three
four
2
4
5
8
Desired df:
Col1
col2
col3
col4
One
Two
Three
four
2
4
5
8
Column names should also start from the nth row, I want to add n empty rows at the beginning of my Dataframe.
I'm not sure why you would want to do this but I did it by splitting up the original dataframe into a dataframe with a row of the column names and a separate dataframe of the data. I then created a dataframe of nans to be the blank rows and joined the 3 together. You will need to import numpy for this.
I created a variable no_cols to be the number of columns in the dataframe and no_empty_rows to be how many empty rows to simplify code:
no_cols = len(df.columns)
no_empty_rows = 6
Then I turned the columns into their own dataframe, with 1 row which is the column names, and headers as np.nan:
cols = pd.DataFrame([df.columns], columns = [np.nan]*no_cols)
NaN NaN NaN NaN
0 Col1 col2 col3 col4
Next I renamed the columns in the original dataframe to nan:
df.columns = [np.nan]*no_cols
NaN NaN NaN NaN
0 One Two Three four
1 2 4 5 8
Then I created a new dataframe of nans, with 6 blank rows (this can be changed):
df_empty_rows = (pd.DataFrame(data=[[np.nan]*no_cols]*no_empty_rows,
columns=[np.nan]*no_cols,
index=[np.nan]*no_empty_rows))
NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
You can then append together all 3. First I put the columns and data of df back together and reset their index, then append that to df_empty_rows:
df_out = df_empty_rows.append(cols.append(df).reset_index(drop=True))
NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN
0.0 Col1 col2 col3 col4
1.0 One Two Three four
2.0 2 4 5 8
Full code:
no_cols = len(df.columns)
no_empty_rows = 6
cols = pd.DataFrame([df.columns], columns=[np.nan]*no_cols)
df.columns = [np.nan]*no_cols
df_empty_rows = (pd.DataFrame(data=[[np.nan]*no_cols]*no_empty_rows,
columns=[np.nan]*no_cols,
index=[np.nan]*no_empty_rows))
df_out = df_empty_rows.append(cols.append(df).reset_index(drop=True))
I am trying to understand why I am getting NaN for all rows when I extract non na values in a specific column. This happens only when I read in the excel file. It works fine with the csv
df=pd.read_excel('q.xlsx',sheet_name=None)
cols=['Name','Age','City']
for k,v in df.items():
if k=="Sheet1":
mod_cols=v.columns.to_list()
#The below is to filter on the column that is extra apart from the ones defined in cols.
#The reason I am doing this because I have multiple sheets in
#the excel file and when I iterate over the entire excel file, I want to filter on that additional column in each
#of those sheets. For this example, will focus on the first sheet
diff=set(mod_cols)-set(cols)
#diff is State in this case
d=v[~v[diff].isna()]
d
Name Age City State
0 NaN NaN NaN NaN
1 NaN NaN NaN NJ
2 NaN NaN NaN NaN
3 NaN NaN NaN NY
4 NaN NaN NaN NaN
5 NaN NaN NaN NC
6 NaN NaN NaN NaN
However with csv, it returns perfectly
df=pd.read_csv('q.csv')
d=df[~df['State'].isna()]
d
Name Age City State
1 Joe 31 Newark NJ
3 Mike 32 NYC NY
5 Moe 33 Durham NC
I have data frame as shown below
ID Type Desc D_N D_A C_N C_A
1 Edu Education 3 100 NaN NaN
1 Bank In_Pay NaN NaN NaN 900
1 Eat Food 4 200 NaN NaN
1 Edu Education NaN NaN NaN NaN
1 Bank NaN NaN NaN 4 700
1 Eat Food NaN NaN NaN NaN
2 Edu Education NaN NaN 1 100
2 Bank NaN NaN NaN 8 NaN
2 NaN Food 4 NaN NaN NaN
3 Edu Education NaN NaN NaN NaN
3 Bank NaN 2 300 NaN NaN
3 Eat Food NaN 140 NaN NaN
From the above df, I would like to filter the rows where exactly one of the columns D_N, D_A, C_N and C_A has non zero.
Expected Output:
ID Type Desc D_N D_A C_N C_A
1 Bank In_Pay NaN NaN NaN 900
2 Bank NaN NaN NaN 8 NaN
2 NaN Food 4 NaN NaN NaN
3 Eat Food NaN 140 NaN NaN
I tried the below code but that does not work.
df[df.loc[:, ["D_N", "D_A", "C_N", "C_A"]].isna().sum(axis=1).eq(1)]
Use DataFrame.count for count values excluded missing values:
df1 = df[df[["D_N", "D_A", "C_N", "C_A"]].count(axis=1).eq(1)]
print (df1)
ID Type Desc D_N D_A C_N C_A
1 1 Bank In_Pay NaN NaN NaN 900.0
7 2 Bank NaN NaN NaN 8.0 NaN
8 2 NaN Food 4.0 NaN NaN NaN
11 3 Eat Food NaN 140.0 NaN NaN
Your solution is possible modify with test non missing values:
df1 = df[df[["D_N", "D_A", "C_N", "C_A"]].notna().sum(axis=1).eq(1)]
I want to merge content for respective rows' data only where some specific conditions are met.
Here is the test dataframe I am working on
Date Desc Debit Credit Bal
0 04-08-2019 abcdef 45654 NaN 345.0
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.06
4 06-08-2019 xyz NaN 350.0 695.06
in which, I want to join the rows where there is nan into Date to the previous row.
Output required:
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654 NaN 345.06
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.0
4 06-08-2019 xyz NaN 350.0 695.06
If anybody help me out with this? I have tried the following:
for j in [x for x in range(lst[0], lst[-1]+1) if x not in lst]:
print (test.loc[j-1:j, ].apply(lambda x: ''.join(str(x)), axis=1))
But could not get the expected result.
You can use
d = df["Date"].fillna(method='ffill')
df.update(df.groupby(d).transform('sum'))
print(df)
output
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 0.0 351.0
1 NaN abcdefjklmn 45654.0 0.0 351.0
2 05-08-2019 abd 45.0 0.0 345.0
3 06-08-2019 xyz 0.0 345.0 54645.0
idx = test.loc[test["Date"].isna()].index
test.loc[idx-1, "Desc"] = test.loc[idx-1]["Desc"].str.cat(test.loc[idx]["Desc"])
test.loc[idx-1, "Bal"] = (test.loc[idx-1]["Bal"].astype(str)
.str.cat(test.loc[idx]["Bal"].astype(str)))
## I tried to add two values but it didn't work as expected, giving 351.0
# test.loc[idx-1, "Bal"] = test.loc[idx-1]["Bal"].values + test.loc[idx]["Bal"].values
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 NaN 345.06.0
1 NaN jklmn NaN NaN 6
2 05-08-2019 abd 45.0 NaN 345
3 06-08-2019 xyz NaN 345.0 54645