Update Column based on another column and Delete data from the other - python-3.x

Lets assume the df looks like:
import pandas as pd
df = pd.DataFrame(data={'fname':['Anky','Anky','Tom','Harry','Harry','Harry'],'lname':['sur1','sur1','sur2','sur3','sur3','sur3'],'role':['','abc','def','ghi','','ijk'],'mobile':['08511663451212','+4471123456','0851166346','','0851166347',''],'Pmobile':['085116634512','1234567890','8885116634','','+353051166347','0987654321']})
import numpy as np
df.replace('',np.nan,inplace=True)
df:
fname lname role mobile Pmobile
0 Anky sur1 NaN 08511663451212 085116634512
1 Anky sur1 abc +4471123456 1234567890
2 Tom sur2 def 0851166346 8885116634
3 Harry sur3 ghi NaN NaN
4 Harry sur3 NaN 0851166347 +353051166347
5 Harry sur3 ijk NaN 0987654321
So I want to update the column mobile with values from Pmobile where the values starts with '08','8','+353 and simultaneously it should delete the value from Pmobile field where it finds a match and copies data to mobile field.
Presently I am getting this by :
df.mobile.update(df['Pmobile'][df['Pmobile'].str.startswith(('08','8','+353'),na=False)])
df.Pmobile[df.mobile==df.Pmobile] = np.nan
df:
fname lname role mobile Pmobile
0 Anky sur1 NaN 085116634512 NaN
1 Anky sur1 abc +4471123456 1234567890
2 Tom sur2 def 8885116634 NaN
3 Harry sur3 ghi NaN NaN
4 Harry sur3 NaN +353051166347 NaN
5 Harry sur3 ijk NaN 0987654321
Is there a way to do this on the fly?
Thanks in advance. :)

You can use shift to shift the columns left do this:
In[50]:
df.loc[df['Pmobile'].str.startswith(('08','8','+353'),na=False), ['mobile','Pmobile']] = df[['mobile','Pmobile']].shift(-1,axis=1)
df
Out[50]:
fname lname role mobile Pmobile
0 Anky sur1 NaN 085116634512 NaN
1 Anky sur1 abc +4471123456 1234567890
2 Tom sur2 def 8885116634 NaN
3 Harry sur3 ghi NaN NaN
4 Harry sur3 NaN +353051166347 NaN
5 Harry sur3 ijk NaN 0987654321
So use your condition to mask the rows of interest and then assign the result of those 2 columns shifted left by 1 where the condition is met.
This will leave a NaN where the value has shifted and do nothing where the condition isn't met

Related

How to read unmerged column in Pandas and transpose them

I have a excel with multiple sheets in the below format. I need to create a single dataframe by concatenating all the sheets, unmerging the cell and then transposing them into a column based on the sheet
Sheet 1:
Sheet 2:
Final Dataframe should look like below
Result expected - I need the below format with an extra coulmn as below
Code So far:
Reading File:
df = pd.concat(pd.read_excel('/Users/john/Desktop/Internal/Raw Files/Med/Dig/File_2017_2022.xlsx', sheet_name=None, skiprows=1))
Creating Column :
df_1 = pd.concat([df.assign(name=n) for n,df in dfs.items()])
Use read_excel with header=[0,1] for MultiIndex by first 2 rows of header and index_col=[0,1] for MultiIndex by first 2 columns, so possible in loop reshape by DataFrame.stack, add new column, use concat and last set index names by DataFrame.rename_axis with convert to columns by DataFrame.reset_index:
dfs = pd.read_excel('Input_V1.xlsx',sheet_name=None, header=[0,1], index_col=[0,1])
df_1 = (pd.concat([df.stack(0).assign(name=n) for n,df in dfs.items()])
.rename_axis(index=['Date','WK','Brand'], columns=None)
.reset_index())
df_1.insert(len(df_1.columns) - 2, 'Campaign', df_1.pop('Campaign'))
print (df_1)
Date WK Brand A B C D E F G \
0 2017-10-02 Week 40 ABC NaN NaN NaN NaN 56892.800000 83431.664000 NaN
1 2017-10-09 Week 41 ABC NaN NaN NaN NaN 0.713716 0.474025 NaN
2 2017-10-16 Week 42 ABC NaN NaN NaN NaN 0.025936 0.072500 NaN
3 2017-10-23 Week 43 ABC NaN NaN NaN NaN 0.182677 0.926731 NaN
4 2017-10-30 Week 44 ABC NaN NaN NaN NaN 0.755607 0.686115 NaN
.. ... ... ... .. .. .. .. ... ... ..
99 2018-03-26 Week 13 PQR NaN NaN NaN NaN 47702.000000 12246.000000 NaN
100 2018-04-02 Week 14 PQR NaN NaN NaN NaN 38768.000000 46498.000000 NaN
101 2018-04-09 Week 15 PQR NaN NaN NaN NaN 35917.000000 45329.000000 NaN
102 2018-04-16 Week 16 PQR NaN NaN NaN NaN 39639.000000 51343.000000 NaN
103 2018-04-23 Week 17 PQR NaN NaN NaN NaN 50867.000000 30119.000000 NaN
H I J K L Campaign name
0 NaN NaN NaN 0.017888 0.697324 NaN ABC
1 NaN NaN NaN 0.457963 0.810985 NaN ABC
2 NaN NaN NaN 0.743030 0.253668 NaN ABC
3 NaN NaN NaN 0.038683 0.050028 NaN ABC
4 NaN NaN NaN 0.885567 0.712333 NaN ABC
.. .. .. .. ... ... ... ...
99 NaN NaN NaN 9433.000000 17108.000000 WX PQR
100 NaN NaN NaN 12529.000000 23557.000000 WX PQR
101 NaN NaN NaN 20395.000000 44228.000000 WX PQR
102 NaN NaN NaN 55077.000000 45149.000000 WX PQR
103 NaN NaN NaN 45815.000000 35761.000000 WX PQR
[104 rows x 17 columns]
I created my own version of your excel, which looks like
this
The code below is far from perfect but it should do fine as long as you do not have millions of sheets
# First, obtain all sheet names
full_df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx',
sheet_name=None, skiprows=0)
# Store them into a list
sheet_names = list(full_df.keys())
# Create an empty Dataframe to store the contents from each sheet
final_df = pd.DataFrame()
for sheet in sheet_names:
df = pd.read_excel(r'C:\Users\.\Downloads\test.xlsx', sheet_name=sheet, skiprows=0)
# Get the brand name
brand = df.columns[1]
# Remove the header columns and keep the numerical values only
df.columns = df.iloc[0]
df = df[1:]
df = df.iloc[:, 1:]
# Set the brand name into a new column
df['Brand'] = brand
# Append into the final dataframe
final_df = pd.concat([final_df, df])
Your final_df should look like this once exported back to excel
EDIT: You might need to drop the dataframe's index upon saving it by using the df.reset_index(drop=True) function, to remove the first column shown in the image right above.

Extract rows when a column value is not na in pandas dataframe

I am trying to understand why I am getting NaN for all rows when I extract non na values in a specific column. This happens only when I read in the excel file. It works fine with the csv
df=pd.read_excel('q.xlsx',sheet_name=None)
cols=['Name','Age','City']
for k,v in df.items():
if k=="Sheet1":
mod_cols=v.columns.to_list()
#The below is to filter on the column that is extra apart from the ones defined in cols.
#The reason I am doing this because I have multiple sheets in
#the excel file and when I iterate over the entire excel file, I want to filter on that additional column in each
#of those sheets. For this example, will focus on the first sheet
diff=set(mod_cols)-set(cols)
#diff is State in this case
d=v[~v[diff].isna()]
d
Name Age City State
0 NaN NaN NaN NaN
1 NaN NaN NaN NJ
2 NaN NaN NaN NaN
3 NaN NaN NaN NY
4 NaN NaN NaN NaN
5 NaN NaN NaN NC
6 NaN NaN NaN NaN
However with csv, it returns perfectly
df=pd.read_csv('q.csv')
d=df[~df['State'].isna()]
d
Name Age City State
1 Joe 31 Newark NJ
3 Mike 32 NYC NY
5 Moe 33 Durham NC

INDEX and MATCH with multiple criteria in Pandas Python

I am trying to do an index match in 2 data set but having trouble. Here is an example of what I am trying to do. I want to fill in column "a", "b", "c" that are empty in df with the df2 data where "Machine", "Year", and "Order Type".
The first dataframe lets call this one "df"
Machine Year Cost a b c
0 abc 2014 5500 nan nan nan
1 abc 2015 89 nan nan nan
2 abc 2016 600 nan nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 nan nan nan
6 dcb 2020 3000 nan nan nan
7 dcb 2021 100 nan nan nan
The second data set is called "df2"
Order Type Machine Year Total Count
0 a abc 2014 1
1 b abc 2014 1
2 c abc 2014 2
4 c dcb 2015 4
3 a abc 2016 3
Final Output is:
Machine Year Cost a b c
0 abc 2014 5500 1 1 2
1 abc 2015 89 nan nan nan
2 abc 2016 600 3 nan nan
3 abc 2017 250 nan nan nan
4 abc 2018 2100 nan nan nan
5 abc 2019 590 1 nan nan
6 dcb 2014 3000 nan nan 4
7 dcb 2015 100 nan nan nan
Thanks for help in advance
Consider DataFrame.pivot to reshape df2 to merge with df1.
final_df = (
df1.reindex(["Machine", "Type", "Cost"], axis=True)
.merge(
df.pivot(
index=["Machine", "Year"],
columns="Order Type",
values="Total Count"
).reset_index(),
on = ["Machine", "Year"]
)
)

insert rows with next 10 business days to dataframe

df has three columns - date, name, and qty. For each name and date combination I want to insert n rows such that name is repeated in these next n rows but date is increased by 1 business day and qty=nan if that name and date combination doesn't exist already in df.
>>> import pandas as pd
>>> from datetime import datetime
>>> df = pd.DataFrame({'name':['abd']*3 + ['pqr']*2 + ['xyz']*1, 'date':[datetime(2020,1,6), datetime(2020,1,8), datetime(2020,2,5), datetime(2017,10,4), datetime(2017,10,13), datetime(2013,5,27)], 'qty':range(6)})
>>> df
name date qty
0 abd 2020-01-06 10
1 abd 2020-01-08 1
2 abd 2020-02-05 2
3 pqr 2017-10-04 3
4 pqr 2017-10-13 4
5 xyz 2013-05-27 5
I am not sure how to go about it. Any thoughts/clues. Thanks a lot!
Desired output for n=3:
name date qty
0 abd 2020-01-06 10
1 abd 2020-01-07 nan
2 abd 2020-01-08 1
3 abd 2020-01-09 nan
4 abd 2020-01-10 nan
5 abd 2020-01-13 nan
6 abd 2020-02-05 2
7 abd 2020-02-08 nan
8 abd 2020-02-09 nan
9 abd 2020-02-10 nan
10 pqr 2017-10-04 3
11 pqr 2017-10-05 nan
12 pqr 2017-10-06 nan
13 pqr 2017-10-09 nan
14 pqr 2017-10-13 4
15 pqr 2017-10-16 nan
16 pqr 2017-10-17 nan
17 pqr 2017-10-18 nan
18 xyz 2013-05-27 5
19 xyz 2013-05-28 nan
20 xyz 2013-05-29 nan
21 xyz 2013-05-30 nan
Here is a way:
from functools import reduce
n = 3
new_index = (
df.groupby("name")
.apply(
lambda x: reduce(
lambda i, j: i.union(j),
[pd.bdate_range(i, periods=n + 1) for i in x["date"]],
)
)
.explode()
)
midx = pd.MultiIndex.from_frame(new_index.reset_index(), names=["name", "date"])
df_out = df.set_index(["name", "date"]).reindex(midx).reset_index()
df_out
If explode cannot be used:
from functools import reduce
n = 3
new_index = (
df.groupby("name")
.apply(
lambda x: reduce(
lambda i, j: i.union(j),
[pd.bdate_range(i, periods=n + 1) for i in x["date"]],
)
)
.apply(pd.Series)
.stack()
.reset_index(level=0)
.rename(columns={0:'date'})
)
df_out = new_index.merge(df, how='left', on=['name', 'date'])
df_out
Output:
name date qty
0 abd 2020-01-06 0.0
1 abd 2020-01-07 NaN
2 abd 2020-01-08 1.0
3 abd 2020-01-09 NaN
4 abd 2020-01-10 NaN
5 abd 2020-01-13 NaN
6 abd 2020-02-05 2.0
7 abd 2020-02-06 NaN
8 abd 2020-02-07 NaN
9 abd 2020-02-10 NaN
10 pqr 2017-10-04 3.0
11 pqr 2017-10-05 NaN
12 pqr 2017-10-06 NaN
13 pqr 2017-10-09 NaN
14 pqr 2017-10-13 4.0
15 pqr 2017-10-16 NaN
16 pqr 2017-10-17 NaN
17 pqr 2017-10-18 NaN
18 xyz 2013-05-27 5.0
19 xyz 2013-05-28 NaN
20 xyz 2013-05-29 NaN
21 xyz 2013-05-30 NaN
How it works:
First import reduce from functools to use pd.Index.union to create a single list of dates. The list of dates is created from pd.bdate_range, with in groupby for each name. Convert that list of new_index, and names to a MultiIndex using pd.MultiIndex.from_frame. Use reindex after set_index on the original dataframe.

Combine text from multiple rows in pandas

I want to merge content for respective rows' data only where some specific conditions are met.
Here is the test dataframe I am working on
Date Desc Debit Credit Bal
0 04-08-2019 abcdef 45654 NaN 345.0
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.06
4 06-08-2019 xyz NaN 350.0 695.06
in which, I want to join the rows where there is nan into Date to the previous row.
Output required:
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654 NaN 345.06
1 NaN jklmn NaN NaN 6
2 04-08-2019 pqr NaN 23 368.06
3 05-08-2019 abd 23 NaN 345.0
4 06-08-2019 xyz NaN 350.0 695.06
If anybody help me out with this? I have tried the following:
for j in [x for x in range(lst[0], lst[-1]+1) if x not in lst]:
print (test.loc[j-1:j, ].apply(lambda x: ''.join(str(x)), axis=1))
But could not get the expected result.
You can use
d = df["Date"].fillna(method='ffill')
df.update(df.groupby(d).transform('sum'))
print(df)
output
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 0.0 351.0
1 NaN abcdefjklmn 45654.0 0.0 351.0
2 05-08-2019 abd 45.0 0.0 345.0
3 06-08-2019 xyz 0.0 345.0 54645.0
idx = test.loc[test["Date"].isna()].index
test.loc[idx-1, "Desc"] = test.loc[idx-1]["Desc"].str.cat(test.loc[idx]["Desc"])
test.loc[idx-1, "Bal"] = (test.loc[idx-1]["Bal"].astype(str)
.str.cat(test.loc[idx]["Bal"].astype(str)))
## I tried to add two values but it didn't work as expected, giving 351.0
# test.loc[idx-1, "Bal"] = test.loc[idx-1]["Bal"].values + test.loc[idx]["Bal"].values
Date Desc Debit Credit Bal
0 04-08-2019 abcdefjklmn 45654.0 NaN 345.06.0
1 NaN jklmn NaN NaN 6
2 05-08-2019 abd 45.0 NaN 345
3 06-08-2019 xyz NaN 345.0 54645

Resources