I have a dataframe of time series data, in which data reporting starts at different times (columns) for different observation units (rows). Prior to first reported datapoint for each unit, the dataframe contains NaN values, e.g.
0 1 2 3 4 ...
A NaN NaN 4 5 6 ...
B NaN 7 8 NaN 10...
C NaN 2 11 24 17...
I want to replace the leading (left-side) NaN values with 0, but only the leading ones (i.e. leaving the internal missing ones as NaN. So the result on the example above would be:
0 1 2 3 4 ...
A 0 0 4 5 6 ...
B 0 7 8 NaN 10...
C 0 2 11 24 17...
(Note the retained NaN for row B col 3)
I could iterate through the dataframe row-by-row, identify the first index of a non-NaN value in each row, and replace everything left of that with 0. But is there a way to do this as a whole-array operation?
notna + cumsum by rows, cells with zeros are leading NaN:
df[df.notna().cumsum(1) == 0] = 0
df
0 1 2 3 4
A 0.0 0.0 4 5.0 6
B 0.0 7.0 8 NaN 10
C 0.0 2.0 11 24.0 17
Here is another way using cumprod() and apply()
s = df.isna().cumprod(axis=1).sum(axis=1)
df.apply(lambda x: x.fillna(0,limit = s.loc[x.name]),axis=1)
Output:
0 1 2 3 4
A 0.0 0.0 4.0 5.0 6.0
B 0.0 7.0 8.0 NaN 10.0
C 0.0 2.0 11.0 24.0 17.0
This question already has answers here:
Python: Justifying NumPy array
(2 answers)
How to move Nan values to end in all columns
(2 answers)
Closed 1 year ago.
I have the following DF:
AA BB CC
1 1 1
NaN 3 NaN
4 4 6
NaN NaN 3
NaN
NaN
4
The output should be:
AA BB CC
1 1 1
4 3 6
4 3
4
I've tried:
df = df.dropna(subset=['AA', 'BB', 'CC'])
AA BB CC
0 2 3 1
2 5 5 6
and this is the output I get.
Is there anything else I should be doing differently?
You can use:
df.apply(lambda x: x.dropna().reset_index(drop = True))
AA BB CC
0 1.0 1.0 1.0
1 4.0 3.0 6.0
2 NaN 4.0 3.0
3 NaN NaN 4.0
import pandas as pd
data={'col1':[1,3,3,1,2,3,2,2]}
df=pd.DataFrame(data,columns=['col1'])
print df
col1
0 1
1 3
2 3
3 1
4 2
5 3
6 2
7 2
Expected result:
Col1 newCol1
0 1. 1
1 3. 3
2 3. NaN
3. 1. 1
4 2. 2
5 3. 3
6 2. 2
7. 2. Nan
Try where combine with shift
df['col2'] = df.col1.where(df.col1.ne(df.col1.shift()))
df
Out[191]:
col1 col2
0 1 1.0
1 3 3.0
2 3 NaN
3 1 1.0
4 2 2.0
5 3 3.0
6 2 2.0
7 2 NaN
I have a dataframe like below. I would like to sum row 0 to 4 (every 5 rows) and create another column with summed value ("new column"). My real dataframe has 263 rows so, last three rows every 12 rows will be sum of three rows only. How I can do this using Pandas/Python. I have started to learn Python recently. Thanks for any advice in advance!
My data patterns is more complex as I am using the index as one of my column values and it repeats like:
Row Data "new column"
0 5
1 1
2 3
3 3
4 2 14
5 4
6 8
7 1
8 2
9 1 16
10 0
11 2
12 3 5
0 3
1 1
2 2
3 3
4 2 11
5 2
6 6
7 2
8 2
9 1 13
10 1
11 0
12 1 2
...
259 50 89
260 1
261 4
262 5 10
I tried iterrows and groupby but can't make it work so far.
Use this:
df['new col'] = df.groupby(df.index // 5)['Data'].transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Data new col
0 5 NaN
1 1 NaN
2 3 NaN
3 3 NaN
4 2 14.0
5 4 NaN
6 8 NaN
7 1 NaN
8 2 NaN
9 1 16.0
Edit to handle updated question:
g = df.groupby(df.Row).cumcount()
df['new col'] = df.groupby([g, df.Row // 5])['Data']\
.transform('sum')[lambda x: ~(x.duplicated(keep='last'))]
Output:
Row Data new col
0 0 5 NaN
1 1 1 NaN
2 2 3 NaN
3 3 3 NaN
4 4 2 14.0
5 5 4 NaN
6 6 8 NaN
7 7 1 NaN
8 8 2 NaN
9 9 1 16.0
10 10 0 NaN
11 11 2 NaN
12 12 3 5.0
13 0 3 NaN
14 1 1 NaN
15 2 2 NaN
16 3 3 NaN
17 4 2 11.0
18 5 2 NaN
19 6 6 NaN
20 7 2 NaN
21 8 2 NaN
22 9 1 13.0
23 10 1 NaN
24 11 0 NaN
25 12 1 2.0
I have a dataframe that looks like this:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
I need to get a dataframe which has unique userId, number of ratings by the user and the average rating by the user as shown below:
userId count mean
0 1 3 2.83
1 2 2 4.5
2 3 3 3.5
3 4 2 4.5
4 5 3 4.0
Can someone help?
df1 = df.groupby('userId')['rating'].agg(['count','mean']).reset_index()
print(df1)
userId count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000
Drop movieId since we're not using it, groupby userId, and then apply the aggregation methods:
import pandas as pd
df = pd.DataFrame({'userId': [1,1,1,2,2,3,3,3,4,4,5,5,5],
'movieId':[31,1029,3671,10,17,60,110,247,10,112,3,39,104],
'rating':[2.5,3.0,3.0,4.0,5.0,3.0,4.0,3.5,4.0,5.0,4.0,4.0,4.0]})
df = df.drop('movieId', axis=1).groupby('userId').agg(['count','mean'])
print(df)
Which produces:
rating
count mean
userId
1 3 2.833333
2 2 4.500000
3 3 3.500000
4 2 4.500000
5 3 4.000000
Here's a NumPy based approach using the fact that userID column appears to be sorted -
unq, tags, count = np.unique(df.userId.values, return_inverse=1, return_counts=1)
mean_vals = np.bincount(tags, df.rating.values)/count
df_out = pd.DataFrame(np.c_[unq, count], columns = (('userID', 'count')))
df_out['mean'] = mean_vals
Sample run -
In [103]: df
Out[103]:
userId movieId rating
0 1 31 2.5
1 1 1029 3.0
2 1 3671 3.0
3 2 10 4.0
4 2 17 5.0
5 3 60 3.0
6 3 110 4.0
7 3 247 3.5
8 4 10 4.0
9 4 112 5.0
10 5 3 4.0
11 5 39 4.0
12 5 104 4.0
In [104]: df_out
Out[104]:
userID count mean
0 1 3 2.833333
1 2 2 4.500000
2 3 3 3.500000
3 4 2 4.500000
4 5 3 4.000000