How to delete rows with the same value? Merge column with same prefix - python-3.x

Hi everyone, I have two question need helping
Question 2
I have df with data as belows:
ABC_x
Quantity silent
ABC_y
Quantity noirse
A
05
NaN
NaN
B
03
NaN
NaN
NaN
NaN
D
08
NaN
NaN
E
09
G
01
NaN
NaN
How to merge two column ABC_x and ABC_y (same prefix ABC) to one column ABC, and merge data of two column special quantity to one column Quantity?
DF expected:
ABC
Quantity
A
05
B
03
D
08
E
09
G
01
Thank you for reading and help me troubleshoot problem, Have a nice day <3
I have try but unsuccessful

Question 1
pandas has a function duplicated that gives you true for duplicates and false otherwise
In [40]: df.duplicated(["Column A"])
Out[40]:
0 False
1 True
dtype: bool
You can use this for boolean indexing
In [43]: df.loc[df.duplicated(["Column A"]), "Column A"] = np.nan
In [44]: df
Out[44]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN ValueB ValueC Value_D002 Value_E06 Value_F4
and the same for the other columns.
Note
You can also pass multiple columns with
In [52]: df.loc[
...: df.duplicated(["Column A", "Column B", "Column C"]),
...: ["Column A", "Column B", "Column C"],
...: ] = np.nan
In [53]: df
Out[53]:
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
However, this would replace only where all three columns are duplicated at the same time.
Question 2
pandas has a function fill to replace nan values. From your example I assume there is either a value in _x or _y. In this case you can use backfill to use _x if it is there and take _y otherwise
In [76]: df[["ABC_x", "ABC_y"]].fillna(method="backfill", axis=1)
Out[76]:
ABC_x ABC_y
0 A NaN
1 B NaN
2 D D
3 E E
4 G NaN
Then do this for ABC as well as Quantity and use the first column only:
In [82]: pd.DataFrame({
"ABC": df[["ABC_x", "ABC_y"]].fillna(method="backfill", axis=1).iloc[:, 0],
"Quantity": df[["Quantity silent", "Quantity noirse"]].fillna(method="backfill", axis=1).iloc[:, 0].astype(int),
})
Out[82]:
ABC Quantity
0 A 5
1 B 3
2 D 8
3 E 9
4 G 1
The astype(int) in the end is just because nan is not a valid integer, so pandas interprets the numbers as floats in the presence of nan

Question1
when column name have 'Column', chk duplicated to NaN
cond1 = df.columns.str.contains('Column')
df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated()))
result:
Column A Column B Column C Column D Column E Column F
0 ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NaN NaN NaN Value_D002 Value_E06 Value_F4
make result to join to name
full code
cond1 = df.columns.str.contains('Column')
df.loc[:, ~cond1].join(df.loc[:, cond1].apply(lambda x: x.mask(x.duplicated())))
Name Column A Column B Column C Column D Column E Column F
0 NameA ValueA ValueB ValueC Value_D001 Value_E01 Value_F3
1 NameA NaN NaN NaN Value_D002 Value_E06 Value_F4
Question2
df.set_axis(df.columns.str.split('[ _]').str[0], axis=1).groupby(level=0, axis=1).first()
result
ABC Quantity
0 A 05
1 B 03
2 D 08
3 E 09
4 G 01

Related

calculate different between consecutive date records at an ID level

I have a dataframe as
col 1 col 2
A 2020-07-13
A 2020-07-15
A 2020-07-18
A 2020-07-19
B 2020-07-13
B 2020-07-19
C 2020-07-13
C 2020-07-18
I want it to become the following in a new dataframe
col_3 diff_btw_1st_2nd_date diff_btw_2nd_3rd_date diff_btw_3rd_4th_date
A 2 3 1
B 6 NaN NaN
C 5 NaN NaN
I tried getting the groupby at Col 1 level , but not getting the intended result. Can anyone help?
Use GroupBy.cumcount for counter pre column col 1 and reshape by DataFrame.set_index with Series.unstack, then use DataFrame.diff, remove first only NaNs columns by DataFrame.iloc, convert timedeltas to days by Series.dt.days per all columns and change columns names by DataFrame.add_prefix:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.set_index(['col 1',df.groupby('col 1').cumcount()])['col 2']
.unstack()
.diff(axis=1)
.iloc[:, 1:]
.apply(lambda x: x.dt.days)
.add_prefix('diff_')
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2 3.0 1.0
1 B 6 NaN NaN
2 C 5 NaN NaN
Or use DataFrameGroupBy.diff with counter for new columns by DataFrame.assign, reshape by DataFrame.pivot and remove NaNs by c2 with DataFrame.dropna:
df['col 2'] = pd.to_datetime(df['col 2'])
df = (df.assign(g = df.groupby('col 1').cumcount(),
c1 = df.groupby('col 1')['col 2'].diff().dt.days)
.dropna(subset=['c1'])
.pivot('col 1','g','c1')
.add_prefix('diff_')
.rename_axis(None, axis=1)
.reset_index())
print (df)
col 1 diff_1 diff_2 diff_3
0 A 2.0 3.0 1.0
1 B 6.0 NaN NaN
2 C 5.0 NaN NaN
You can assign a cumcount number grouped by col 1, and pivot the table using that cumcount number.
Solution
df["col 2"] = pd.to_datetime(df["col 2"])
# 1. compute date difference in days using diff() and dt accessor
df["diff"] = df.groupby(["col 1"])["col 2"].diff().dt.days
# 2. assign cumcount for pivoting
df["cumcount"] = df.groupby("col 1").cumcount()
# 3. partial transpose, discarding the first difference in nan
df2 = df[["col 1", "diff", "cumcount"]]\
.pivot(index="col 1", columns="cumcount")\
.drop(columns=[("diff", 0)])
Result
# replace column names for readability
df2.columns = [f"d{i+2}-d{i+1}" for i in range(len(df2.columns))]
print(df2)
d2-d1 d3-d2 d4-d3
col 1
A 2.0 3.0 1.0
B 6.0 NaN NaN
C 5.0 NaN NaN
df after assing cumcount is like this
print(df)
col 1 col 2 diff cumcount
0 A 2020-07-13 NaN 0
1 A 2020-07-15 2.0 1
2 A 2020-07-18 3.0 2
3 A 2020-07-19 1.0 3
4 B 2020-07-13 NaN 0
5 B 2020-07-19 6.0 1
6 C 2020-07-13 NaN 0
7 C 2020-07-18 5.0 1

Find Ranges of Null Values in a Column - Pandas

I'm trying to set the ranges of NaN values in a df like this:
[Column_1] [Column_2]
1 A 10
2 B 20
3 C NaN
4 D NaN
5 E NaN
6 F 60
7 G 65
8 H NaN
9 I NaN
10 J NaN
11 K 90
12 L NaN
13 M 100
So, for now what I just did was to list the index of the NaN values with this line:
df['Column_2'].isnull()].index.tolist()
But then, I don't know how to set the intervals of these values in terms of Column_1, which for this case would be:
[C-E] [H-J] [L]
Thanks for your insights!
Filter the rows where the values in Column_2 are NaN, then groupby these rows on consecutive occurrence of NaN values in Column_2 and collect the corresponding values of Column_1 inside a list comprehension:
m = df['Column_2'].isna()
r = [[*g['Column_1']] for _, g in df[m].groupby((~m).cumsum())]
print(r)
[['C', 'D', 'E'], ['H', 'I', 'J'], ['L']]

Pandas delete and shift cells in a column basis multiple conditions

I have a situation where I would want to delete and shift cells in a pandas data frame basis some conditions. My data frame looks like this :
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
A 1 D 1 G 1
B 1 E 2 H 1
C 1 F 2 I 3
C 1 F 2 H 1
Now I want to compare the following conditions:
ID_2 and ID_3 should always be less than or equal to ID_1. If anyone of them is greater than ID_1 then that cell should be deleted and shifted with the next column cell
The output should look like the following :
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
A 1 D 1 G 1
B 1 H 1 blank nan
C 1 blank nan blank nan
C 1 H 1 blank nan
You can create mask by condition, here for greater values like ID_1 by DataFrame.gt::
cols1 = ['Value_2','Value_3']
cols2 = ['ID_2','ID_3']
m = df[cols2].gt(df['ID_1'], axis=0)
print (m)
ID_2 ID_3
0 False False
1 True False
2 True True
3 True False
Then replace missing values if match mask by DataFrame.mask:
df[cols2] = df[cols2].mask(m)
df[cols1] = df[cols1].mask(m.to_numpy())
And last use DataFrame.shift with set new columns by Series.mask:
df1 = df[cols2].shift(-1, axis=1)
df['ID_2'] = df['ID_2'].mask(m['ID_2'], df1['ID_2'])
df['ID_3'] = df['ID_3'].mask(m['ID_2'])
df2 = df[cols1].shift(-1, axis=1)
df['Value_2'] = df['Value_2'].mask(m['ID_2'], df2['Value_2'])
df['Value_3'] = df['Value_3'].mask(m['ID_2'])
print (df)
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
0 A 1 D 1.0 G 1.0
1 B 1 H 1.0 NaN NaN
2 C 1 NaN NaN NaN NaN
3 C 1 H 1.0 NaN NaN
And last if necessary replace by empty strings:
df[cols1] = df[cols1].fillna('')
print (df)
Value_1 ID_1 Value_2 ID_2 Value_3 ID_3
0 A 1 D 1.0 G 1.0
1 B 1 H 1.0 NaN
2 C 1 NaN NaN
3 C 1 H 1.0 NaN

How to convert column into row?

Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'
Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'
Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'
But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like
1) I dont know which all columns will have same values
2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value
df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()
df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']
To be more clear, the below code creates the sample dataframe & expected output
df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
'01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
'Product': ['a','a','b','c','d','d', 'e','e']})
expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
, 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})
you may do following to get expected_output from df
s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
df1.columns = df1.columns.map(''.join)
Out[57]:
Date Date1 Product Product1
Cust_id gender
100 M 01/01/2019 02/01/2019 a a
101 F 01/01/2019 01/01/2019 b c
102 M 03/01/2019 NaN d NaN
103 F 04/01/2019 NaN d NaN
104 F 03/01/2019 03/01/2019 e e
Next, replace columns having duplicated values with NA
df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()
Out[72]:
Cust_id gender Date Date1 Product Product1
0 100 M 01/01/2019 02/01/2019 a NA
1 101 F 01/01/2019 NA b c
2 102 M 03/01/2019 NA d NA
3 103 F 04/01/2019 NA d NA
4 104 F 03/01/2019 NA e NA
You can try this code - it could be a little cleaner but I think it does the job
df = pd.DataFrame({'a':[100, 100], 'b':['tue', 'tue'], 'c':['yes', 'yes'],
'd':['ok', 'not ok'], 'e':['ok', 'maybe'], 'f':[55, 66]})
df_transformed = pd.DataFrame()
for column in df.columns:
col_vals = df.groupby(column)['b'].count().index.values
for ix, col_val in enumerate(col_vals):
temp_df = pd.DataFrame({column + str(ix) : [col_val]})
df_transformed = pd.concat([df_transformed, temp_df], axis = 1)
Output for df_transformed

DataFrame difference between rows based on multiple columns

I am trying to calculate the difference between rows based on multiple columns. The data set is very large and I am pasting dummy data below that describes the problem:
if I want to calculate the daily difference in weight at a pet+name level. So far I have only come up with the solution of concatenating these columns and creating multiindex based on the new column and the date column. But I think there should be a better way. In the real dataset I have more than 3 columns I am using calculate row difference.
df['pet_name']=df.pet + df.name
df.set_index(['pet_name','date'],inplace = True)
df.sort_index(inplace=True)
df['diffs']=np.nan
for idx in t.index.levels[0]:
df.diffs[idx] = df.weight[idx].diff()
Base on your description , you can try groupby
df['pet_name']=df.pet + df.name
df.groupby('pet_name')['weight'].diff()
Use groupby by 2 columns:
df.groupby(['pet', 'name'])['weight'].diff()
All together:
#convert dates to datetimes
df['date'] = pd.to_datetime(df['date'])
#sorting
df = df.sort_values(['pet', 'name','date'])
#get differences per groups
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
Sample:
np.random.seed(123)
N = 100
L = list('abc')
df = pd.DataFrame({'pet': np.random.choice(L, N),
'name': np.random.choice(L, N),
'date': pd.Series(pd.date_range('2015-01-01', periods=int(N/10)))
.sample(N, replace=True),
'weight':np.random.rand(N)})
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(['pet', 'name','date'])
df['diffs'] = df.groupby(['pet', 'name', 'date'])['weight'].diff()
df['pet_name'] = df.pet + df.name
df = df.sort_values(['pet_name','date'])
df['diffs1'] = df.groupby(['pet_name', 'date'])['weight'].diff()
print (df.head(20))
date name pet weight diffs pet_name diffs1
1 2015-01-02 a a 0.105446 NaN aa NaN
2 2015-01-03 a a 0.845533 NaN aa NaN
2 2015-01-03 a a 0.980582 0.135049 aa 0.135049
2 2015-01-03 a a 0.443368 -0.537214 aa -0.537214
3 2015-01-04 a a 0.375186 NaN aa NaN
6 2015-01-07 a a 0.715601 NaN aa NaN
7 2015-01-08 a a 0.047340 NaN aa NaN
9 2015-01-10 a a 0.236600 NaN aa NaN
0 2015-01-01 b a 0.777162 NaN ab NaN
2 2015-01-03 b a 0.871683 NaN ab NaN
3 2015-01-04 b a 0.988329 NaN ab NaN
4 2015-01-05 b a 0.918397 NaN ab NaN
4 2015-01-05 b a 0.016119 -0.902279 ab -0.902279
5 2015-01-06 b a 0.095530 NaN ab NaN
5 2015-01-06 b a 0.894978 0.799449 ab 0.799449
5 2015-01-06 b a 0.365719 -0.529259 ab -0.529259
5 2015-01-06 b a 0.887593 0.521874 ab 0.521874
7 2015-01-08 b a 0.792299 NaN ab NaN
7 2015-01-08 b a 0.313669 -0.478630 ab -0.478630
7 2015-01-08 b a 0.281235 -0.032434 ab -0.032434

Resources