Remove all data in a DF by group based on a condition (pandas,python3) - python-3.x

I have a pandas DF like this:
User Enrolled Time
1 0 12
1 0 1
1 1 2
1 1 3
2 1 3
2 0 4
2 1 1
3 0 2
3 0 3
3 1 4
4 0 1
I want to remove all rows of a users information after they have enrolled. Each users chance to enroll is timed in order. Expected output to look like this:
User Enrolled Time
1 0 12
1 0 1
1 1 2
2 1 3
3 0 2
3 0 3
3 1 4
Hoping someone could help me!
EDIT: Example based on comment for correct answer:
User Enrolled Time
4 0 1
4 0 2
4 0 3
5 0 1

I think what you're looking for is a groupby followed by an apply which does the correct logic for each user. For example:
df = pd.DataFrame([[ 1, 0, 12],
[ 1, 0, 1],
[ 1, 1, 2],
[ 1, 1, 3],
[ 2, 1, 3],
[ 2, 0, 4],
[ 2, 1, 1],
[ 3, 0, 2],
[ 3, 0, 3],
[ 3, 1, 4]],
columns=['User', 'Enrolled', 'Time'])
def filter_enrollment(df):
enrolled = df[df.Enrolled == 1].index.min()
return df[df.index <= enrolled]
result = df.groupby('User').apply(filter_enrollment).reset_index(drop=True)
The result is:
>>> print(result)
User Enrolled Time
0 1 0 12
1 1 0 1
2 1 1 2
3 2 1 3
4 3 0 2
5 3 0 3
6 3 1 4
Here I'm assuming your rows are in order of time. If you want to expliticly filter by the time column instead just change index to Time in the filter function.
Edit: to get the answer of the edited question, you can change the filter function to something like this:
def filter_enrollment(df):
enrolled = df[df.Enrolled == 1].index.min()
if pd.isnull(enrolled):
return df
else:
return df[df.index <= enrolled]

Related

Using Pandas to assign specific values

I have the following dataframe:
data = {'id': [1, 2, 3, 4, 5, 6, 7, 8],
'stat': ['ordered', 'unconfirmed', 'ordered', 'unknwon', 'ordered', 'unconfirmed', 'ordered', 'back'],
'date': ['2021', '2022', '2023', '2024', '2025','2026','2027', '1990']
}
df = pd.DataFrame(data)
df
I am trying to get the following data frame:
Unfortunate I am not successful so far and I used the following commands (for loops) for only stat==ordered:
y0 = np.zeros((len(df), 8), dtype=int)
y1 = [1990]
if stat=='ordered':
for i in df['id']:
for j in y1:
if df.loc[i].at['date'] in y1:
y0[i][y1.index(j)] = 1
else:
y0[i][y1.index(j)] = 0
But unfortunately it did not returned the expected solution and beside that it takes a very long time to do the calculation. I tried to use gruopby, but it could not fgure out either how to use it perporly since it is faster than using for loops. Any idea would be very appreiciated.
IIUC:
df.join(
pd.get_dummies(df.date).cumsum(axis=1).mul(
[1, 2, 1, 3, 1, 2, 1, 0], axis=0
).astype(int)
)
id stat date 1990 2021 2022 2023 2024 2025 2026 2027
0 1 ordered 2021 0 1 1 1 1 1 1 1
1 2 unconfirmed 2022 0 0 2 2 2 2 2 2
2 3 ordered 2023 0 0 0 1 1 1 1 1
3 4 unknwon 2024 0 0 0 0 3 3 3 3
4 5 ordered 2025 0 0 0 0 0 1 1 1
5 6 unconfirmed 2026 0 0 0 0 0 0 2 2
6 7 ordered 2027 0 0 0 0 0 0 0 1
7 8 back 1990 0 0 0 0 0 0 0 0

Count positive, negative or zero values numbers for multiple columns in Python

Given a dataset as follows:
[{'id': 1, 'ltp': 2, 'change': nan},
{'id': 2, 'ltp': 5, 'change': 1.5},
{'id': 3, 'ltp': 3, 'change': -0.4},
{'id': 4, 'ltp': 0, 'change': 2.0},
{'id': 5, 'ltp': 5, 'change': -0.444444},
{'id': 6, 'ltp': 16, 'change': 2.2}]
Or
id ltp change
0 1 2 NaN
1 2 5 1.500000
2 3 3 -0.400000
3 4 0 2.000000
4 5 5 -0.444444
5 6 16 2.200000
I would like to count the number of positive, negative and 0 values for columns ltp and change, the result may like this:
columns positive negative zero
0 ltp 5 0 1
1 change 3 2 0
How could I do that with Pandas or Numpy? Thanks.
Updated: if I need groupby type and count following the logic above
id ltp change type
0 1 2 NaN a
1 2 5 1.500000 a
2 3 3 -0.400000 a
3 4 0 2.000000 b
4 5 5 -0.444444 b
5 6 16 2.200000 b
The expected output:
type columns positive negative zero
0 a ltp 3 0 0
1 a change 1 1 0
2 b ltp 2 0 1
3 b change 2 1 0
Use np.sign with selected columns first, then counts values in value_counts, transpose, replaced missing values and last rename columns names by dictionary with convert index to column columns:
d= {-1:'negative', 1:'positive', 0:'zero'}
df = (np.sign(df[['ltp','change']])
.apply(pd.value_counts)
.T
.fillna(0)
.astype(int)
.rename(columns=d)
.rename_axis('columns')
.reset_index())
print (df)
columns negative zero positive
0 ltp 0 1 5
1 change 2 0 3
EDIT: Another solution with type column with DataFrame.melt, mapping column with np.sign and count values by crosstab:
d= {-1:'negative', 1:'positive', 0:'zero'}
df1 = df.melt(id_vars='type', value_vars=['ltp','change'], var_name='columns')
df1['value'] = np.sign(df1['value']).map(d)
df1 = (pd.crosstab([df1['type'],df1['columns']], df1['value'])
.rename_axis(columns=None)
.reset_index())
print (df1)
type columns negative positive zero
0 a change 1 1 0
1 a ltp 0 3 0
2 b change 1 2 0
3 b ltp 0 2 1

Diagonal Dataframe to 1 row

I need to convert a diagonal Dataframe to 1 row Dataframe.
Input:
df = pd.DataFrame([[7, 0, 0, 0],
[0, 2, 0, 0],
[0, 0, 3, 0],
[0, 0, 0, 8],],
columns=list('ABCD'))
A B C D
0 7 0 0 0
1 0 2 0 0
2 0 0 3 0
3 0 0 0 8
Expected output:
A B C D
0 7 2 3 8
what i tried so far to do this:
df1 = df.sum().to_frame().transpose()
df1
A B C D
0 7 2 3 8
It does the job. But is there any elegant way to do this by groupby or some other pandas builtin?
Not sure if there is any other 'elegant' way, I can only propose alternatives:
Use numpy.diagonal
pd.DataFrame([df.to_numpy().diagonal()], columns=df.columns)
A B C D
0 7 2 3 8
Use groupby with boolean (not sure if this is better than your solution):
df.groupby([True] * len(df), as_index=False).sum()
A B C D
0 7 2 3 8
You can use: np.diagonal(df):
pd.DataFrame(np.diagonal(df), df.columns).T
A B C D
0 7 2 3 8

How can I merge data-frame rows by different columns

I have a DataFrame with 200k rows and some 50 columns with same id in different columns, looking like below:
df = pd.DataFrame({'pic': [1, 0, 0, 0, 2, 0, 3, 0, 0]
, 'story': [0, 1, 0, 2, 0, 0, 0, 0, 3]
, 'des': [0, 0, 1, 0, 0, 2, 0, 3, 0]
, 'some_another_value': [2, 1, 6, 5, 4, 3, 1, 1, 1]
, 'some_value': [1, 2, 3, 4, 5, 6, 7, 8, 9]})
pic story des some_another_value some_value
0 1 0 0 2 nan
1 0 1 0 nan 2
2 0 0 1 nan 3
3 0 2 0 nan 4
4 2 0 0 4 nan
5 0 0 2 nan 6
6 3 0 0 1 nan
7 0 0 3 nan 8
8 0 3 0 nan 9
I would like to merge the rows which have the same value in 'pic' 'story' 'des'
pic story des some_another_value some_value
0 1 1 1 2 5
3 2 2 2 4 10
6 3 3 3 1 17
How can this be achieved?
*I am looking for a solution which not contain a for loop
*Prefer not a sum method
I'm not sure why you say Prefer not a sum method when your expected output data clearly indicate sum. For your sample data, in each row, exactly one of pic, story, des is zero, so:
df.groupby(df[['pic','story', 'des']].sum(1)).sum()
gives
pic story des some_another_value some_value
1 1 1 1 2.0 5.0
2 2 2 2 4.0 10.0
3 3 3 3 1.0 17.0

vectorize groupby pandas

I have a dataframe like this:
day time category count
1 1 a 13
1 2 a 47
1 3 a 1
1 5 a 2
1 6 a 4
2 7 a 14
2 2 a 10
2 1 a 9
2 4 a 2
2 6 a 1
I want to group by day, and category and get a vector of the counts per time. Where time can be between 1 and 10. The max and min of time I have defined in two variables called max and min.
This is how I want the resulting dataframe to look:
day category count
1 a [13,47,1,0,2,4,0,0,0,0]
2 a [9,10,0,2,0,1,14,0,0,0]
Does anyone know how to make this aggregation into a vaector?
Use reindex with MultiIndex.from_product for append missing categories and then groupby with list:
df = df.set_index(['day','time', 'category'])
a = df.index.levels[0]
b = range(1,11)
c = df.index.levels[2]
df = df.reindex(pd.MultiIndex.from_product([a,b,c], names=df.index.names), fill_value=0)
df = df.groupby(['day','category'])['count'].apply(list).reset_index()
print (df)
day category count
0 1 a [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 a [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
EDIT:
df = (df.set_index(['day','time', 'category'])['count']
.unstack(1, fill_value=0)
.reindex(columns=range(1,11), fill_value=0))
print (df)
time 1 2 3 4 5 6 7 8 9 10
day category
1 a 13 47 1 0 2 4 0 0 0 0
2 a 9 10 0 2 0 1 14 0 0 0
df = df.apply(list, 1).reset_index(name='count')
print (df)
day ... count
0 1 ... [13, 47, 1, 0, 2, 4, 0, 0, 0, 0]
1 2 ... [9, 10, 0, 2, 0, 1, 14, 0, 0, 0]
[2 rows x 3 columns]

Resources