How do you maintain on Python the value of the last row in a column, like on excel? - python-3.x

I have looked around and haven't found an 'elegant' solution. It can't be that it is not doable.
What I need is to have a column ('col A') on a dataframe that it is always 0, if the adjacent ('col B') column hits 1, then change the value to 1, and all further rows should be 1 (no matter what else happens on 'col B'), until another column ('col C') hits 1, then 'col A' returns to 0, until this repeats. The data has thousands of rows, and it gets updated regularly.
any ideas? I have tried shift, iloc and loops, but can't make it work.
the result should look something like this:
[sample data][1]
date col A col B col C
... 0 0 0
... 0 0 0
... 1 1 0
... 1 1 0
... 1 0 1
... 0 0 0
... 0 0 0
... 1 1 0
... 1 1 0
... 1 0 0
... 1 0 0
... 1 1 0
... 1 0 0
... 1 1 0
... 1 0 1
... 0 0 0
This is the base code I have been thinking about, but I can't get it to work:
df['B'] = df['A'].apply(lambda x: 1 if x == 1 else 0)
for i in range(1, len(df)):
if df.loc[i, 'C'] == 1:
df.loc[i, 'B'] = 0
else:
df.loc[i, 'B'] = df.loc[i-1, 'B']

Related

Dataframe conversion

I am trying to convert a data frame into a 1,0 matrix format
data = pd.DataFrame({'Val1':['A','B','B'],
'Val2':['C','A','D'],
'Val3':['E','F','C'],
'Comb':['Comb1','Comb2','Comb3']})
data:
Val1 Val2 Val3 Comb
0 A C E Comb1
1 B A F Comb2
2 B D C Comb3
What I need is to convert to below data frame
Comb A C D E B F
0 Comb1 1 1 0 1 0 0
1 Comb2 1 0 0 0 1 1
2 Comb3 0 1 1 0 1 0
I was able to do it with a FOR loop but as my dataframe increases, the processing time increases. Is there a better way to do it?
header = set(data[['Val1','Val2','Val3']].values.ravel())
matrix = pd.DataFrame(columns=header)
for i in range(data.shape[0]):
temp_dict = {data["Val1"].iloc[i]:1, data["Val2"].iloc[i]:1, data["Val3"].iloc[i]:1}
matrix = matrix.append(temp_dict, ignore_index=True)
matrix = matrix.loc[:, matrix.columns.notnull()]
matrix = matrix.fillna(0)
matrix = pd.merge(data[["Comb"]],matrix, left_index=True, right_index=True, how= 'outer')
Thanks!
There may be a better solution, but this is what came to my mind: convert each raw to a dictionary of "present" letters, build a Series from the dictionary, and combine the Series into a dataframe.
data.loc[:, 'Val1':'Val3'].apply(lambda row:
pd.Series({letter: 1 for letter in row}), axis=1)\
.fillna(0).astype(int).join(data.Comb)
# A B C D E F Comb
#0 1 0 1 0 1 0 Comb1
#1 1 1 0 0 0 1 Comb2
#2 0 1 1 1 0 0 Comb3
There are propably multiple ways to solve this, I used pd.crosstab for it:
import pandas as pd
data = pd.DataFrame({'Val1':['A','B','B'],
'Val2':['C','A','D'],
'Val3':['E','F','C'],
'Comb':['Comb1','Comb2','Comb3']})
data["lst"] = data[['Val1', 'Val2', 'Val3']].values.tolist()
data = data.explode("lst")
print(pd.crosstab(data["Comb"], data["lst"]))
Out[20]:
lst A B C D E F
Comb
Comb1 1 0 1 0 1 0
Comb2 1 1 0 0 0 1
Comb3 0 1 1 1 0 0
I guess this will work. Please let me know if it works
pd.get_dummies(data, columns=['Val1','Val2','Val3'],prefix="",prefix_sep="").groupby(axis=1,level=0).sum()
Here's another way:
data.melt('Comb').set_index('Comb')['value'].str.get_dummies().sum(level=0).reset_index()
Output:
Comb A B C D E F
0 Comb1 1 0 1 0 1 0
1 Comb2 1 1 0 0 0 1
2 Comb3 0 1 1 1 0 0

How to identify where a particular sequence in a row occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 0
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 0 0 1 0 0 1
E 1 1 1 1 1 1
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to find the column number where the 1 0 0 pattern occurs for the first time. For example, for member A, the first 1 0 0 pattern occurs at appear_1. so the first occurrence will be 1. Similarly for the member B, the first 1 0 0 pattern occurs at appear_2, so the first occurrence will be at column 2. The resulting table should have a new column named 'first_occurrence'. If there is no such 1 0 0 pattern occurs (like in row E) then the value in first occurrence column will the sum of number of 1 in that row. The resulting table should look something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
A 1 0 0 1 0 0 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 4
D 0 0 1 0 0 1 3
E 1 1 1 1 1 1 6
Thank you in advance.
I try not to reinvent the wheel, so I develop on my answer to previous question. From that answer, you need to use additional idxmax, np.where, and get_indexer
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['first_occurrence'] = np.where(m2.any(1), df1.columns.get_indexer(m2.idxmax(1)),
df1.shape[1])
Out[540]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 first_occurrence
0 A 1 0 0 1 0 0 1
1 B 1 1 0 0 1 0 2
2 C 1 0 1 1 0 0 4
3 D 0 0 1 0 0 1 3
4 E 1 1 1 1 1 1 6

How to identify a sequence and index number before a particular sequence occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 1 1 0 1 0 0
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to count the number of times '1' occurs before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at appear_2 and appear_3, so the duration will be 1. Similarly for the member B, the first double zero event occurs at appear_3 and appear_4 so there are a total of two 1s that occur before this. So, the 1 included in 1,0,0 sequence is also considered during the count of total number of 1. it is because the 1 indicates that a person started the process, and 0,0 indicates his/her absence for two consecutive appearances after initiating the process. The resulting table should have a new column 'duration' something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 3
D 1 1 1 1 0 0 4
Thank you in advance.
A little logic here , first we use rolling sum find the value equal to 0 , then we just need to do cumprod, once it hit the 0, the prod will return 0, then we just need to sum all value not 0 for each row get the result
s=df.iloc[:,1:]
s1=s.rolling(2,axis=1,min_periods=1).sum().cumprod(axis=1)
s.mask(s1==0).sum(1)
Out[37]:
0 1.0
1 2.0
2 3.0
3 4.0
dtype: float64
My logic is checking the current position to next position. If they are both 0, the mask turns to True at that location. After that doing cumsum on axis=1. Locations are in front the first True will turn to 0 by cumsum. Finally, comparing mask to 0 to keep only positions appear before the double 0 and sum. To use this logic, I need to handle the case where double 0 are the first elements in row as in 'D', 0, 0, 1, 1, 0, 0. Your sample doesn't have this case. However, I expect the real data would have it.
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[100]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 1 1 1 1 0 0 4.0
Change your sample to have the special case where the first elements are 0
Update: add case E where all appear_x are 1.
Sample (df_n):
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
0 A 1 0 0 1 0 1
1 B 1 1 0 0 1 0
2 C 1 0 1 1 0 0
3 D 0 0 1 1 0 0
4 E 1 1 1 1 1 1
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df_n[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df_n['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[503]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 0 0 1 1 0 0 2.0
4 E 1 1 1 1 1 1 6.0

pandas if else only on specific rows

I have a pandas dataframe as below. I want to apply below condition
Only for row where A =2, update the column 'C', 'D' TO -99.
I have a function like below which updates the value of C and D to -99.
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
Now i just want to call that function, if A =2. I tried the below code but it updates all the rows of C and D to -99
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
df
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
if (df['A'] == 2).any():
func(df)
print(df)
My expected output:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0
You can do that by filtering:
df.loc[df['A'] == 2, ['C', 'D']] = -99
Here the first item of the filtering filters the rows, and we filter these such that we only select rows where the value for the column of 'A' is 2. We filter the columns by a list of names (C and D). We then assign -99 to these items.
For the given sample data, we obtain:
>>> df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
>>> df.loc[df['A'] == 2, ['C', 'D']] = -99
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0

List column name having value greater than zero

I have following dataframe
A | B | C | D
1 0 2 1
0 1 1 0
0 0 0 1
I want to add the new column have any value of row in the column greater than zero along with column name
A | B | C | D | New
1 0 2 1 A-1, C-2, D-1
0 1 1 0 B-1, C-1
0 0 0 1 D-1
We can use mask and stack
s=df.mask(df==0).stack().\
astype(int).astype(str).\
reset_index(level=1).apply('-'.join,1).add(',').sum(level=0).str[:-1]
df['New']=s
df
Out[170]:
A B C D New
0 1 0 2 1 A-1,C-2,D-1
1 0 1 1 0 B-1,C-1
2 0 0 0 1 D-1
Combine the column names with the df values that are not zero and then filter out the None values.
df = pd.read_clipboard()
arrays = np.where(df!=0, df.columns.values + '-' + df.values.astype('str'), None)
new = []
for array in arrays:
new.append(list(filter(None, array)))
df['New'] = new
df
Out[1]:
A B C D New
0 1 0 2 1 [A-1, C-2, D-1]
1 0 1 1 0 [B-1, C-1]
2 0 0 0 1 [D-1]

Resources