creating a binary vector from the dataframe column values - python-3.x

I have a dataframe df
ID KD DT
0 1 2 5.6
1 1 5 8.7
4 4 9 1.9
5 4 2 1.7
6 4 7 8.8
2 6 9 8.3
3 6 7 7.2
9 7 36 3.1
10 7 2 2.2
12 7 7 5.6
I want to create a dataframe such that for each unique KD value, new columns of {-1,0,1} are added depending on ID (from a list of ID values) and DT. ID = [1,2,4,6,7,8]. New dataframe should have len(ID)+1 columns with first column the unique KD value and len(D) columns such that column ID = 1 if df.loc[(df.ID==id) & (df.KD==kd),'DT'] >= 5, column ID = 0 if (kd,id) pair is not in df and column ID = -1 if df.loc[(df.ID==id) & (df.KD==kd),'DT'] < 5
For the dataframe given above new dataframe should be
df2
KD 1 2 4 6 7 8
0 2 1 0 -1 0 -1 0
1 5 1 0 0 0 0 0
2 7 0 0 1 1 1 0
3 9 0 0 -1 1 0 0
4 36 0 0 0 0 -1 0
In fact, number of unique KD and ID are very large (in the range of 10K). Any help in finding a very efficient way to do this. please ?

Let's try this using pivot and mask:
ID = [1,2,4,6,7,8]
df_p = df.pivot('KD', 'ID', 'DT')
df_p.mask((df_p >= 5), 1).mask(df_p < 5, -1).reindex(ID, axis=1)\
.fillna(0).reset_index()
Output:
ID KD 1 2 4 6 7 8
0 2 1.0 0.0 -1.0 0.0 -1.0 0.0
1 5 1.0 0.0 0.0 0.0 0.0 0.0
2 7 0.0 0.0 1.0 1.0 1.0 0.0
3 9 0.0 0.0 -1.0 1.0 0.0 0.0
4 36 0.0 0.0 0.0 0.0 -1.0 0.0

Related

Change value of a specific column on dataframe subgroups in pandas based on condition

I have a dataframe similar to the one below:
A B C
1 0 0.0
1 2 0.2
1 3 1.0
2 1 0.2
2 4 0.0
2 6 1.0
3 1 0.4
3 2 1.0
3 0 0.9
3 3 0.0
Now, for each subgroup, where a subgroup will have a shared A value, I want to find the row that has the minimum B value, then change the value of C for that row to 0.5. In this case, I would obtain a new dataframe:
A B C
1 0 0.5
1 2 0.2
1 3 1.0
2 1 0.5
2 4 0.0
2 6 1.0
3 1 0.4
3 2 1.0
3 0 0.5
3 3 0.0
As an addendum, if this operation replaces a 0.0 or 1.0 in the C column, then I'd like for the row to be duplicated with its old value. In this case, the A=1 subgroup infringes this rule (0.0 is replaced with 0.5) and therefore should produce:
A B C
1 0 0.0
1 0 0.5
1 2 0.2
1 3 1.0
...
The first problem is the main one, the second one isn't a priority, but of course, would welcome help with either.
Try:
df.loc[df.groupby('A')['B'].idxmin(), 'C'] = 0.5
Output:
A B C
0 1 0 0.5
1 1 2 0.2
2 1 3 1.0
3 2 1 0.5
4 2 4 0.0
5 2 6 1.0
6 3 1 0.4
7 3 2 1.0
8 3 0 0.5
9 3 3 0.0
For the addendum:
# minimum B rows
min_rows = df.groupby('A')['B'].idxmin()
# minimum B rows with C==0
zeros = df.loc[min_rows].loc[lambda x: x['C']==0].copy()
# change all min rows to 0.5
df.loc[min_rows, 'C'] = 0.5
# concat with 0
df = pd.concat([df, zeros])
Output (notice the last row):
A B C
0 1 0 0.5
1 1 2 0.2
2 1 3 1.0
3 2 1 0.5
4 2 4 0.0
5 2 6 1.0
6 3 1 0.4
7 3 2 1.0
8 3 0 0.5
9 3 3 0.0
0 1 0 0.0

How can I calculate a rolling mean only when Marker column is 1

I want to calculate a rolling mean only when a Marker column is1. This is a small example but real world data is massive and needs to be efficient.
df = pd.DataFrame()
df['Obs']=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
df['Marker']=[0,0,0,0,1,0,0,0,0,1,0,0,0,0,1]
df['Mean']=(df.Obs.rolling(5).mean())
How can I create a Desired column like this:
df['Desired']=[0,0,0,0,3.0,0,0,0,0,8.0,0,0,0,0,13.0]
print(df)
Obs Marker Mean Desired
0 1 0 NaN 0.0
1 2 0 NaN 0.0
2 3 0 NaN 0.0
3 4 0 NaN 0.0
4 5 1 3.0 3.0
5 6 0 4.0 0.0
6 7 0 5.0 0.0
7 8 0 6.0 0.0
8 9 0 7.0 0.0
9 10 1 8.0 8.0
10 11 0 9.0 0.0
11 12 0 10.0 0.0
12 13 0 11.0 0.0
13 14 0 12.0 0.0
14 15 1 13.0 13.0
You are close, just need a where:
df['Mean']= df.Obs.rolling(5).mean().where(df['Marker']==1, 0)
Output:
Obs Marker Mean
0 1 0 0.0
1 2 0 0.0
2 3 0 0.0
3 4 0 0.0
4 5 1 3.0
5 6 0 0.0
6 7 0 0.0
7 8 0 0.0
8 9 0 0.0
9 10 1 8.0
10 11 0 0.0
11 12 0 0.0
12 13 0 0.0
13 14 0 0.0
14 15 1 13.0

how to count rows when value change from value greater than threshold to 0

I have three columns in dataframe , X1 X2 X3 , i want to count rows when value change from value greater than 1 to 0 . if before 0 value less than 1 dont need to count.
input df:
df1=pd.DataFrame({'x1':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X2':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X3':[6,3,0.5,0,0,0,0,20,15,16,0,0,70]})
print(df1)
x1 X2 X3
0 3 3 6.0
1 4 4 3.0
2 7 7 0.5
3 0 0 0.0
4 0 0 0.0
5 0 0 0.0
6 0 0 0.0
7 20 20 20.0
8 15 15 15.0
9 16 16 16.0
10 0 0 0.0
11 0 0 0.0
12 70 70 70.0
desired_output
x1_count X2_count X3_count
0 6 6 2
Idea is replace 0 to missing values, forward filling them, convert all another values to NaNs, compare greater like 1 and count Trues by sum to Series converted to one row DataFrame with transpose:
m = df1.eq(0)
df2 = (df1.mask(m)
.ffill()
.where(m)
.gt(1)
.sum()
.add_suffix('_count')
.to_frame()
.T
)
print (df2)
x1_count X2_count X3_count
0 6 6 2

Better way to show only first instance in sequence of repeating values in a pandas dataframe?

When the value in my dataframe column A is 1 or -1, I want to store that value in a new column. When the following value is the same as previous (but not zero), I want to set it to zero. My code works exactly as I want, but I want to know if there is a more readable way of doing this?:
import pandas as pd
d = {'A':[0,0,1,1,1,-1,-1,-1,0,-1]}
df = pd.DataFrame(d)
df['match'] = df['A'].loc[~df['A'].eq(df['A'].shift())]
df['match'] = df['match'].fillna(0)
df
Out[1]:
A match
0 0 0.0
1 0 0.0
2 1 1.0
3 1 0.0
4 1 0.0
5 -1 -1.0
6 -1 0.0
7 -1 0.0
8 0 0.0
9 -1 -1.0
We can take advantage of Series.where to also fill in and avoid Series.fillna.
df['match']=df['A'].where(df['A'].ne(df['A'].shift()),0)
print(df)
Output
A match
0 0 0
1 0 0
2 1 1
3 1 0
4 1 0
5 -1 -1
6 -1 0
7 -1 0
8 0 0
9 -1 -1
Like stated in the comments, there's nothing wrong with your code right now. But here's another method for your convenience, using Series.where, Series.diff and Series.fillna:
df['match'] = df['A'].where(df['A'].diff().ne(0)).fillna(0)
A match
0 0 0.0
1 0 0.0
2 1 1.0
3 1 0.0
4 1 0.0
5 -1 -1.0
6 -1 0.0
7 -1 0.0
8 0 0.0
9 -1 -1.0

How do I add a counter column that starts and stops at specific rows in a Pandas DataFrame?

I have an existing DataFrame in Pandas that has a column containing 3 different values (Column1). I want to be able to create a column so that it counts each row at every "Start" and stops counting at the next "End" (Column2). What is the best way to do this? I'm not sure how to approach this problem and the output is a strict requirement.
Sample Output:
Column1 Column2
0 0
0 0
0 0
0 0
Start 1
0 2
0 3
0 4
End 5
0 0
0 0
0 0
Start 1
0 2
End 3
mask + ffill
This answer assumes that a Start appears in the DataFrame before an End appears, or it will get the filling reversed.
col = df['Column1']
m = col.ne('Start') & col.shift().ne('End')
v = col.eq('Start').mask(m).ffill().fillna(0)
v.groupby(v.ne(v.shift()).cumsum()).cumsum()
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 2.0
6 3.0
7 4.0
8 5.0
9 0.0
10 0.0
11 0.0
12 1.0
13 2.0
14 3.0
Name: Column1, dtype: float64
Explanation
First, find any value that isn't a start or an end
>>> m
0 True
1 True
2 True
3 True
4 False
5 True
6 True
7 True
8 True
9 False
10 True
11 True
12 False
13 True
14 True
Name: Column1, dtype: bool
Next, mask any invalid value and ffill, which means that all values in between a Start and an End with be filled with 1
>>> v
0 0.0
1 0.0
2 0.0
3 0.0
4 1.0
5 1.0
6 1.0
7 1.0
8 1.0
9 0.0
10 0.0
11 0.0
12 1.0
13 1.0
14 1.0
Name: Column1, dtype: float64
Finally, group by consecutive repeated values, and cumsum.

Resources