How to check when column value change from 0 to 1 and after many count column 1 and column 2 values are same - python-3.x

I have one dataframe with column x and column y , I want to check when column x value changes from 0 to 1 and count column y value change from 0 to 1 after how many rows after x changes from 0 to 1
here is my dataframe;
df1=pd.DataFrame({'x':[0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,1,1,1,0,0,1,1,1,1],'y':[0,0,0,0,0,0,1,1,1,0,0,0,0,0,0,1,1,1,0,0,1,1,1,1]})
desired_output
df_out=pd.DataFrame({'count_delay':[1,3,0]})

You can try with diff
id1 = df1.index[df1.x.diff().eq(1)]
id2 = df1.index[df1.y.diff().eq(1)]
id2-id1
Int64Index([1, 3, 0], dtype='int64')
For groupby
df1.groupby(df1.x.diff().eq(1).cumsum()).y.apply(lambda x : x.index[x.diff().eq(1)]-x.index.min())
x
0 Int64Index([], dtype='int64')
1 Int64Index([1], dtype='int64')
2 Int64Index([3], dtype='int64')
3 Int64Index([], dtype='int64')
Name: y, dtype: object

Related

Get all columns per id where the column is equal to a value

Say I have a pandas dataframe:
id A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 1
id4 0 0 0 0
I want to select all the columns per id where the column name is equal to 1, this list will then be a new column in the dataframe.
Expected output:
id A B C D Result
id1 0 1 0 1 [B,D]
id2 1 0 0 1 [A,D]
id3 0 0 0 1 [D]
id4 0 0 0 0 []
I tried df.apply(lambda row: row[row == 1].index, axis=1) but the output of the 'Result' was not in the form in specified above
You can do what you are trying to do adding .tolist():
df['Result'] = df.apply(lambda row: row[row == 1].index.tolist(), axis=1)
Saying that, your approach of using lists as values inside a single column seems contradictory with the Pandas approach of keeping data tabular (only one value per cell). It will probably be better to use nested lists instead of pandas to do what you are trying to do.
Setup
I used a different set of ones and zeros to highlight skipping an entire row.
df = pd.DataFrame(
[[0, 1, 0, 1], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 1, 0]],
['id1', 'id2', 'id3', 'id4'],
['A', 'B', 'C', 'D']
)
df
A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 0
id4 0 0 1 0
Not Your Grampa's Reverse Binerizer
n = len(df)
i, j = np.nonzero(df.to_numpy())
col_names = df.columns[j]
positions = np.bincount(i).cumsum()[:-1]
result = np.split(col_names, positions)
df.assign(Result=[a.tolist() for a in result])
A B C D Result
id
id1 0 1 0 1 [B, D]
id2 1 0 0 1 [A, D]
id3 0 0 0 0 []
id4 0 0 1 0 [C]
Explanations
Ohh, the details!
np.nonzero on a 2-D array will return two arrays of equal length. The first array will have the 1st dimensional position of each element that is not zero. The second array will have the 2nd dimensional position of each element that is not zero. I'll call the first array i and the second array j.
In the figure below, I label the columns with what j they represent and correspondingly, I label the rows with what i they represent.
For each non-zero element of the dataframe, I place above the value a tuple with the (ith, jth) dimensional positions and in brackets the [kth] non-zero element in the dataframe.
# j → 0 1 2 3
# A B C D
# i
# ↓ (0,1)[0] (0,3)[1]
# 0 id1 0 1 0 1
#
# (1,0)[2] (1,3)[3]
# 1 id2 1 0 0 1
#
#
# 2 id3 0 0 0 0
#
# (3,2)[4]
# 3 id4 0 0 1 0
In the figure below, I show what i and j look like. I label each row with the same k in the brackets in the figure above
# i j
# ----
# 0 1 k=0
# 0 3 k=1
# 1 0 k=2
# 1 3 k=3
# 3 2 k=4
Now it's easy to slice the df.columns with the j array to get all the column labels in one big array.
# df.columns[j]
# B
# D
# A
# D
# C
The plan is to use np.split to chop up df.columns[j] into the sub arrays for each row. Turns out that the information is embedded in the array i. I'll use np.bincount to count how many non-zero elements are in each row. I'll need to tell np.bincount the minimum number of bins we are assuming to have. That minimum is the number of rows in the dataframe. We assign it to n with n = len(df)
# np.bincount(i, minlength=n)
# 2 ← Two non-zero elements in the first row
# 2 ← Two more in the second row
# 0 ← None in this row
# 1 ← And one more in the fourth row
However, if we take the cumulative sum of this array, we get the positions we need to split at.
# np.bincount(i, minlength=n).cumsum()
# 2
# 4
# 4 ← This repeated value results in an empty array for the 3rd row
# 5
Let's look at how this matches up with df.columns[j]. We see below that the column slice gets split exactly where we need.
# B D A D D ← df.columns[j]
# 2 44 5 ← np.bincount(i, minlength=n).cumsum()
One issue is that the 4 values in this array will result in splitting the df.columns[j] array into 5 sub-arrays. This isn't horrible because the last array will always be empty. so we slice it to appropriate size np.bincount(i, minlength=n).cumsum()[:-1]
col_names = df.columns[j]
positions = np.bincount(i, minlength=n).cumsum()[:-1]
result = np.split(col_names, positions)
# result
# [B, D]
# [A, D]
# []
# [D]
The only thing left to do is assign it to a new columns and make the individual sub-arrays lists instead.
df.assign(Result=[a.tolist() for a in result])
# A B C D Result
# id
# id1 0 1 0 1 [B, D]
# id2 1 0 0 1 [A, D]
# id3 0 0 0 0 []
# id4 0 0 1 0 [C]

Sorting by absolute value for value different of zero for one column keeping equal value of another column together

I have the following dataframe :
A B C
============
11 x 2
11 y 0
13 x -10
13 y 0
10 x 7
10 y 0
and i would like to sort C by absolute value for value different of 0. But as i need to keep A values together it would look like below (sorted by absolute value but with 0 in between):
A B C
============
13 x -10
13 y 0
10 x 7
10 y 0
11 x 2
11 y 0
I can't manage to obtain this with sort_values(). If i sort by C, i don't have A values together.
Step 1: get absolute values
# creating a column with the absolute values
df["abs_c"] = df["c"].abs()
Step 2: sort values on absolute values of "c"
# sorting by absolute value of "c" & reseting the index & assigning it back to df
df = df.sort_values("abs_c",ascending=False).reset_index(drop=True)
Step 3: get the order of column "a" based on the sorted values, this is achieved by using drop duplicates of pandas which keeps the first instance of the value in the column a which is sorted based on "c". This will be used in the next step
# getting the order of "a" based on sorted value of "c"
order_a = df["a"].drop_duplicates()
Step 4: based on the order of "a" and the sorted values of "c" creating a data frame
# based on the order_a creating a data frame as per the order_a which is based on the sorted values of abs "c"
sorted_df = pd.DataFrame()
for i in range(len(order_a)):
sorted_df = sorted_df.append(df[df["a"]==order_a[i]])
Step 5:Assigning the sorted df back to df
# reset index of sorted values and assigning it back to df
df = sorted_df.reset_index(drop=True)
Output
a b c abs_c
0 13 x -10 10
1 13 y 0 0
2 10 x 7 7
3 10 y 0 0
4 11 x 2 2
5 11 y 0 0
Doc reference
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html
Sorry, it doesn't turn out very nice, but I almost never use panda. I hope everything works out the way you want it.
import pandas as pd
df = pd.DataFrame({'a': [11, 11, 13, 13, 10, 10],
'b': ['x', 'y', 'x', 'y', 'x', 'y'],
'c': [2, 0, -10, 0, 7, 0]})
mask = df[df['c'] != 0]
mask['abs'] = mask['c'].abs()
mask = mask.sort_values('abs', ascending=False).reset_index()
tempNr = 0
for index, row in df.iterrows():
if row['c'] != 0:
df.loc[index] = mask.loc[tempNr].drop('abs')
tempNr = tempNr + 1
print(df)

How to perform cumulative sum inside iterrows

I have a pandas dataframe as below:
df2 = pd.DataFrame({ 'b' : [1, 1, 1]})
df2
b
0 1
1 1
2 1
I want to create a column 'cumsum' with the cumulative sum of column b starting row 2. Also I want to use iterrows to perform this. I tried below code but it doesnot seem to work.
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b'].cumsum()
My expected output:
b cum_sum
0 1 NaN
1 1 2
2 1 3
As your requirement, you may try this
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[:row_index, 'b'].sum()
Out[10]:
b cumsum
0 1 NaN
1 1 2.0
2 1 3.0
To stick to iterrows():
i=0
df2['cumsum']=0
col=list(df2.columns).index('cumsum')
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b']+df2.iloc[i, col]
i+=1
Outputs:
b cumsum
0 1 0
1 1 1
2 1 2

Collapse values from multiple rows of a column into an array when all other columns values are same

I have a table with 7 columns where for every few rows, 6 columns remain same and only the 7th changes. I would like to merge all these rows into one row, and combine the value of the 7th column into a list.
So if I have this dataframe:
A B C
0 a 1 2
1 b 3 4
2 c 5 6
3 c 7 6
I would like to convert it to this:
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
Since the values of column A and C were same in row 2 and 3, they would get collapsed into a single row and the values of B will be combined into a list.
Melt, explode, and pivot don't seem to have such functionality. How can achieve this using Pandas?
Use GroupBy.agg with custom lambda function, last add DataFrame.reindex for same order of columns by original:
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(['A','C'])['B'].agg(f).reset_index().reindex(df.columns, axis=1)
You can also create columns names dynamic like:
changes = ['B']
cols = df.columns.difference(changes).tolist()
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(cols)[changes].agg(f).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
For all lists in column solution is simplier:
changes = ['B']
cols = df.columns.difference(changes).tolist()
df = df.groupby(cols)[changes].agg(list).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a [1] 2
1 b [3] 4
2 c [5, 7] 6
Here is another approach using pivot_table and applymap:
(df.pivot_table(index='A',aggfunc=list).applymap(lambda x: x[0] if len(set(x))==1 else x)
.reset_index())
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6

Replace values in specified list of columns based on a condition

The actual use case is that I want to replace all of the values in some named columns with zero whenever they are less than zero, but leave other columns alone. Let's say in the dataframe below, I want to floor all of the values in column a and b to zero, but leave column d alone.
df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1],
'c': ['foo', 'goo', 'bar'], 'd' : [1,-2,1]})
df
a b c d
0 0 -3 foo 1
1 -1 2 goo -2
2 2 1 bar 1
The second paragraph in the accepted answer to this question: How to replace negative numbers in Pandas Data Frame by zero does provide a workaround, I can just set the datatype of column d to be non-numeric, and then change it back again afterwards:
df['d'] = df['d'].astype(object)
num = df._get_numeric_data()
num[num <0] = 0
df['d'] = df['d'].astype('int64')
df
a b c d
0 0 0 foo 1
1 0 2 goo -2
2 2 1 bar 1
but this seems really messy, and it means I need to know the list of the columns I don't want to change, rather than the list I do want to change.
Is there a way to just specify the column names directly
You can use mask and column filtering:
df[['a','b']] = df[['a','b']].mask(df<0, 0)
df
Output
a b c d
0 0 0 foo 1
1 0 2 goo -2
2 2 1 bar 1
Using np.where
cols_to_change = ['a', 'b', 'd']
df.loc[:, cols_to_change] = np.where(df[cols_to_change]<0, 0, df[cols_to_change])
a b c d
0 0 0 foo 1
1 0 2 goo 0
2 2 1 bar 1

Resources