Pandas delete and shift cells in a column basis multiple conditions

Pandas delete and shift cells in a column basis multiple conditions - python-3.x

I have a situation where I would want to delete and shift cells in a pandas data frame basis some conditions. My data frame looks like this :
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
A 1 D 1 G 1
B 1 E 2 H 1
C 1 F 2 I 3
C 1 F 2 H 1
Now I want to compare the following conditions:
ID_2 and ID_3 should always be less than or equal to ID_1. If anyone of them is greater than ID_1 then that cell should be deleted and shifted with the next column cell
The output should look like the following :
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
A 1 D 1 G 1
B 1 H 1 blank nan
C 1 blank nan blank nan
C 1 H 1 blank nan

You can create mask by condition, here for greater values like ID_1 by DataFrame.gt::
cols1 = ['Value_2','Value_3']
cols2 = ['ID_2','ID_3']
m = df[cols2].gt(df['ID_1'], axis=0)
print (m)
ID_2 ID_3
0 False False
1 True False
2 True True
3 True False
Then replace missing values if match mask by DataFrame.mask:
df[cols2] = df[cols2].mask(m)
df[cols1] = df[cols1].mask(m.to_numpy())
And last use DataFrame.shift with set new columns by Series.mask:
df1 = df[cols2].shift(-1, axis=1)
df['ID_2'] = df['ID_2'].mask(m['ID_2'], df1['ID_2'])
df['ID_3'] = df['ID_3'].mask(m['ID_2'])
df2 = df[cols1].shift(-1, axis=1)
df['Value_2'] = df['Value_2'].mask(m['ID_2'], df2['Value_2'])
df['Value_3'] = df['Value_3'].mask(m['ID_2'])
print (df)
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
0 A 1 D 1.0 G 1.0
1 B 1 H 1.0 NaN NaN
2 C 1 NaN NaN NaN NaN
3 C 1 H 1.0 NaN NaN
And last if necessary replace by empty strings:
df[cols1] = df[cols1].fillna('')
print (df)
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
0 A 1 D 1.0 G 1.0
1 B 1 H 1.0 NaN
2 C 1 NaN NaN
3 C 1 H 1.0 NaN

Related

How to delete rows with the same value? Merge column with same prefix

Hi everyone, I have two question need helping
Question 2
I have df with data as belows:
ABC_x
Quantity silent
ABC_y
Quantity noirse
A
05
NaN
NaN
B
03
NaN
NaN
NaN
NaN
D
08
NaN
NaN
E
09
G
01
NaN
NaN
How to merge two column ABC_x and ABC_y (same prefix ABC) to one column ABC, and merge data of two column special quantity to one column Quantity?
DF expected:
ABC
Quantity
A
05
B
03
D
08
E
09
G
01
Thank you for reading and help me troubleshoot problem, Have a nice day <3
I have try but unsuccessful

Question 1
pandas has a function duplicated that gives you true for duplicates and false otherwise
In [40]: df.duplicated(["Column A"])
Out[40]:
0 False
1 True
dtype: bool
You can use this for boolean indexing
In [43]: df.loc[df.duplicated(["Column A"]), "Column A"] = np.nan
In [44]: df
Out[44]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN ValueB ValueC Value_D002 Value_E06 Value_F4
and the same for the other columns.
Note
You can also pass multiple columns with
In [52]: df.loc[
...: df.duplicated(["Column A", "Column B", "Column C"]),
...: ["Column A", "Column B", "Column C"],
...: ] = np.nan
In [53]: df
Out[53]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
However, this would replace only where all three columns are duplicated at the same time.
Question 2
pandas has a function fill to replace nan values. From your example I assume there is either a value in _x or _y. In this case you can use backfill to use _x if it is there and take _y otherwise
In [76]: df[["ABC_x", "ABC_y"]].fillna(method="backfill", axis=1)
Out[76]:
ABC_x ABC_y
0 A NaN
1 B NaN
2 D D
3 E E
4 G NaN
Then do this for ABC as well as Quantity and use the first column only:
In [82]: pd.DataFrame({
"ABC": df[["ABC_x", "ABC_y"]].fillna(method="backfill", axis=1).iloc[:, 0],
"Quantity": df[["Quantity silent", "Quantity noirse"]].fillna(method="backfill", axis=1).iloc[:, 0].astype(int),
})
Out[82]:
ABC Quantity
0 A 5
1 B 3
2 D 8
3 E 9
4 G 1
The astype(int) in the end is just because nan is not a valid integer, so pandas interprets the numbers as floats in the presence of nan

Question1
when column name have 'Column', chk duplicated to NaN
cond1 = df.columns.str.contains('Column')
df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated()))
result:
Column A Column B Column C Column D Column E Column F
0 ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NaN NaN NaN Value_D002 Value_E06 Value_F4
make result to join to name
full code
cond1 = df.columns.str.contains('Column')
df.loc[:, ~cond1].join(df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated())))
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
Question2
df.set_axis(df.columns.str.split('[ _]').str[0], axis=1).groupby(level=0, axis=1).first()
result
ABC Quantity
0 A 05
1 B 03
2 D 08
3 E 09
4 G 01

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?

Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN

You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')

Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN

a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

MELT: multiple values without duplication

Cant be this hard. I Have
df=pd.DataFrame({'id':[1,2,3],'name':['j','l','m'], 'mnt':['f','p','p'],'nt':['b','w','e'],'cost':[20,30,80],'paid':[12,23,45]})
I need
import numpy as np
df1=pd.DataFrame({'id':[1,2,3,1,2,3],'name':['j','l','m','j','l','m'], 't':['f','p','p','b','w','e'],'paid':[12,23,45,np.nan,np.nan,np.nan],'cost':[20,30,80,np.nan,np.nan,np.nan]})
I have 45 columns to invert.
I tried
(df.set_index(['id', 'name'])
.rename_axis(['paid'], axis=1)
.stack().reset_index())

EDIT: I think simpliest here is set missing values by variable column in DataFrame.melt:
df2 = df.melt(['id', 'name','cost','paid'], value_name='t')
df2.loc[df2.pop('variable').eq('nt'), ['cost','paid']] = np.nan
print (df2)
id name cost paid t
0 1 j 20.0 12.0 f
1 2 l 30.0 23.0 p
2 3 m 80.0 45.0 p
3 1 j NaN NaN b
4 2 l NaN NaN w
5 3 m NaN NaN e
Use lreshape working with dictionary of lists for specified which columns are 'grouped' together:
df2 = pd.lreshape(df, {'t':['mnt','nt'], 'mon':['cost','paid']})
print (df2)
id name t mon
0 1 j f 20
1 2 l p 30
2 3 m p 80
3 1 j b 12
4 2 l w 23
5 3 m e 45

Pandas print missing value column names and count only

I am using the following code to print the missing value count and the column names.
#Looking for missing data and then handling it accordingly
def find_missing(data):
# number of missing values
count_missing = data_final.isnull().sum().values
# total records
total = data_final.shape[0]
# percentage of missing
ratio_missing = count_missing/total
# return a dataframe to show: feature name, # of missing and % of missing
return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing},
index=data.columns.values)
find_missing(data_final).head(5)
What I want to do is to only print those columns where there is a missing value as I have a huge data set of about 150 columns.
The data set looks like this
A B C D
123 ABC X Y
123 ABC X Y
NaN ABC NaN NaN
123 ABC NaN NaN
245 ABC NaN NaN
345 ABC NaN NaN
In the output I would just want to see :
missing_count missing_ratio
C 4 0.66
D 4 0.66
and not the columns A and B as there are no missing values there

Use DataFrame.isna with DataFrame.sum
to count by columns. We can also use DataFrame.isnull instead DataFrame.isna.
new_df = (df.isna()
.sum()
.to_frame('missing_count')
.assign(missing_ratio = lambda x: x['missing_count']/len(df))
.loc[df.isna().any()] )
print(new_df)
We can also use pd.concat instead DataFrame.assign
count = df.isna().sum()
new_df = (pd.concat([count.rename('missing_count'),
count.div(len(df))
.rename('missing_ratio')],axis = 1)
.loc[count.ne(0)])
Output
missing_count missing_ratio
A 1 0.166667
C 4 0.666667
D 4 0.666667

IIUC, we can assign the missing and total count to two variables do some basic math and assign back to a df.
a = df.isnull().sum(axis=0)
b = np.round(df.isnull().sum(axis=0) / df.fillna(0).count(axis=0),2)
missing_df = pd.DataFrame({'missing_vals' : a,
'missing_ratio' : b})
print(missing_df)
missing_vals ratio
A 1 0.17
B 0 0.00
C 4 0.67
D 4 0.67
you can filter out columns that don't have any missing vals
missing_df = missing_df[missing_df.missing_vals.ne(0)]
print(missing_df)
missing_vals ratio
A 1 0.17
C 4 0.67
D 4 0.67

You can also use concat:
s = df.isnull().sum()
result = pd.concat([s,s/len(df)],1)
result.columns = ["missing_count","missing_ratio"]
print (result)
missing_count missing_ratio
A 1 0.166667
B 0 0.000000
C 4 0.666667
D 4 0.666667

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas delete and shift cells in a column basis multiple conditions - python-3.x

Related

How to delete rows with the same value? Merge column with same prefix

calculate different between consecutive date records at an ID level

pandas groupby and widen dataframe with ordered columns

MELT: multiple values without duplication

Pandas print missing value column names and count only

Categories

Resources