Fill the missing values in the data set

Fill the missing values in the data set - scikit-learn

I have a dataset as below.
building_id meter meter_reading primary_use square_feet air_temperature dew_temperature sea_level_pressure wind_direction wind_speed hour day weekend month
0 0 0 NaN 0 7432 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
1 1 0 NaN 0 2720 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
2 2 0 NaN 0 5376 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
3 3 0 NaN 0 23685 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
4 4 0 NaN 0 116607 25.0 20.0 1019.7 0.0 0.0 0 1 4 1
You can see that the values under meter_reading are Nan and i like to fill that up with that column mean grouped by "primary_use" and "square_feet" column. Which api I could use to achieve this. I am currently using scikit learn's imputer.
Thanks and your help is highly appreciated.

If you use pandas data frame, it already brings everything you need.
Note that priary_use is a categorical feature while square_feet is continuous. So first you would like to split square_feet into categories, so you can calculate the mean meter_reading per group.

Related

Replace only leading NaN values in Pandas dataframe

I have a dataframe of time series data, in which data reporting starts at different times (columns) for different observation units (rows). Prior to first reported datapoint for each unit, the dataframe contains NaN values, e.g.
0 1 2 3 4 ...
A NaN NaN 4 5 6 ...
B NaN 7 8 NaN 10...
C NaN 2 11 24 17...
I want to replace the leading (left-side) NaN values with 0, but only the leading ones (i.e. leaving the internal missing ones as NaN. So the result on the example above would be:
0 1 2 3 4 ...
A 0 0 4 5 6 ...
B 0 7 8 NaN 10...
C 0 2 11 24 17...
(Note the retained NaN for row B col 3)
I could iterate through the dataframe row-by-row, identify the first index of a non-NaN value in each row, and replace everything left of that with 0. But is there a way to do this as a whole-array operation?

notna + cumsum by rows, cells with zeros are leading NaN:
df[df.notna().cumsum(1) == 0] = 0
df
0 1 2 3 4
A 0.0 0.0 4 5.0 6
B 0.0 7.0 8 NaN 10
C 0.0 2.0 11 24.0 17

Here is another way using cumprod() and apply()
s = df.isna().cumprod(axis=1).sum(axis=1)
df.apply(lambda x: x.fillna(0,limit = s.loc[x.name]),axis=1)
Output:
0 1 2 3 4
A 0.0 0.0 4.0 5.0 6.0
B 0.0 7.0 8.0 NaN 10.0
C 0.0 2.0 11.0 24.0 17.0

How can I calculate a rolling mean only when Marker column is 1

I want to calculate a rolling mean only when a Marker column is1. This is a small example but real world data is massive and needs to be efficient.
df = pd.DataFrame()
df['Obs']=[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]
df['Marker']=[0,0,0,0,1,0,0,0,0,1,0,0,0,0,1]
df['Mean']=(df.Obs.rolling(5).mean())
How can I create a Desired column like this:
df['Desired']=[0,0,0,0,3.0,0,0,0,0,8.0,0,0,0,0,13.0]
print(df)
Obs Marker Mean Desired
0 1 0 NaN 0.0
1 2 0 NaN 0.0
2 3 0 NaN 0.0
3 4 0 NaN 0.0
4 5 1 3.0 3.0
5 6 0 4.0 0.0
6 7 0 5.0 0.0
7 8 0 6.0 0.0
8 9 0 7.0 0.0
9 10 1 8.0 8.0
10 11 0 9.0 0.0
11 12 0 10.0 0.0
12 13 0 11.0 0.0
13 14 0 12.0 0.0
14 15 1 13.0 13.0

You are close, just need a where:
df['Mean']= df.Obs.rolling(5).mean().where(df['Marker']==1, 0)
Output:
Obs Marker Mean
0 1 0 0.0
1 2 0 0.0
2 3 0 0.0
3 4 0 0.0
4 5 1 3.0
5 6 0 0.0
6 7 0 0.0
7 8 0 0.0
8 9 0 0.0
9 10 1 8.0
10 11 0 0.0
11 12 0 0.0
12 13 0 0.0
13 14 0 0.0
14 15 1 13.0

Optimized way of modifying a column based on another column of a dataframe

Let's say I have a dataframe like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44 1 96 1 40 1 88 0 81
1 2017-05-01 State NY 0 42 0 55 1 92 1 82 0 38
2 2017-06-01 State NY 1 11 0 7 1 35 0 70 1 61
3 2017-07-01 State NY 1 12 1 80 1 83 1 47 1 44
4 2017-08-01 State NY 1 63 1 48 0 61 0 5 0 20
5 2017-09-01 State NY 1 56 1 92 0 55 0 45 1 17
I'd like to replace all the values of columns with _rank as NaN if it's corresponding flag is zero.To get something like this:
>> Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0
Which is fairly simple. This is my approach for the same:
for k in variables:
dt[k+'_rank'] = np.where(dt[k+'_flag']==0,np.nan,dt[k+'_rank'])
Although this works fine for a smaller dataset, it takes a significant amount of time for processing a dataframe with very high number of columns and entries. So is there a optimized way of achieving the same without iteration?
P.S. There are other payloads apart from _rank and _flag in the data.
Thanks in advance

Use .str.endswith to filter the columns that ends with _flag, then use rstrip to strip the flag label and add rank label to get the corresponding column names with rank label, then use np.where to fill the NaN values in the columns containing _rank depending upon the condition when the corresponding values in flag columns is 0:
flags = df.columns[df.columns.str.endswith('_flag')]
ranks = flags.str.rstrip('flag') + 'rank'
df[ranks] = np.where(df[flags].eq(0), np.nan, df[ranks])
OR, it is also possible to use DataFrame.mask:
df[ranks] = df[ranks].mask(df[flags].eq(0).to_numpy())
Result:
# print(df)
Time level value a_flag a_rank b_flag b_rank c_flag c_rank d_flag d_rank e_flag e_rank
0 2017-04-01 State NY 1 44.0 1 96.0 1 40.0 1 88.0 0 NaN
1 2017-05-01 State NY 0 NaN 0 NaN 1 92.0 1 82.0 0 NaN
2 2017-06-01 State NY 1 11.0 0 NaN 1 35.0 0 NaN 1 61.0
3 2017-07-01 State NY 1 12.0 1 80.0 1 83.0 1 47.0 1 44.0
4 2017-08-01 State NY 1 63.0 1 48.0 0 NaN 0 NaN 0 NaN
5 2017-09-01 State NY 1 56.0 1 92.0 0 NaN 0 NaN 1 17.0

Frequency calculations on subgroups of pandas-groupby, insertion of new rows and rearrangement of columns

I need some help with performing a few operations over subgroups, but I am getting really confused. I will try to describe quickly the operations and the desired output with the comments.
(1) Calculate the % frequency of appearance per subgroup
(2) Appear a record that does not exist with 0
(3) Rearrange order of records and columns
Assume the df below as the raw data:
df=pd.DataFrame({'store':[1,1,1,2,2,2,3,3,3,3],
'branch':['A','A','C','C','C','C','A','A','C','A'],
'products':['clothes', 'shoes', 'clothes', 'shoes', 'accessories', 'clothes', 'bags', 'bags', 'clothes', 'clothes']})
The grouped_df below is close to what I have in mind but I can't get the desired output.
grouped_df=df.groupby(['store', 'branch', 'products']).size().unstack('products').replace({np.nan:0})
# output:
products accessories bags clothes shoes
store branch
1 A 0.0 0.0 1.0 1.0
C 0.0 0.0 1.0 0.0
2 C 1.0 0.0 1.0 1.0
3 A 0.0 2.0 1.0 0.0
C 0.0 0.0 1.0 0.0
# desirable output: if (1), (2) and (3) take place somehow...
products clothes shoes accessories bags
store branch
1 B 0 0 0 0 #group 1 has 1 shoes and 1 clothes for A and C, so 3 in total which transforms each number to 33.3%
A 33.3 33.3 0 0
C 33.3 0.0 0 0
2 B 0 0 0 0
A 0 0 0 0
C 33.3 33.3 33.3 0
3 B 0 0 0 0 #group 3 has 2 bags and 1 clothes for A and C, so 4 in total which transforms the 2 bags into 50% and so on
A 25 0 0 50
C 25 0 0 0
# (3) rearrangement of columns with "clothes" and "shoes" going first
# (3)+(2) branch B appeared and the the order of branches changed to B, A, C
# (1) percentage calculations of the occurrences have been performed over groups that hopefully have made sense with the comments above
I have tried to handle each group separately, but i) it does not take into consideration the replaced NaN values, ii) I should avoid handling each group because I will need to concatenate afterwards a lot of groups (this df is just an example) as I will need to plot the whole group later on.
grouped_df.loc[[1]].transform(lambda x: x*100/sum(x)).round(0)
>>>
products accessories bags clothes shoes
store branch
1 A NaN NaN 50.0 100.0 #why has it transformed on axis='columns'?
C NaN NaN 50.0 0.0
Hopefully my question makes sense. Any insight into what I try to perform is very appreciated in advance, thank you a lot!

With the help of #Quang Hoang who tried to help out with this question a day before I post my answer, I managed to find a solution.
To explain the last bit of the calculation, I transformed every element by dividing it with the sum of counts for each group to find the frequency of each element 0th-level-group-wise and not row/column/total-wise.
grouped_df = df.groupby(['store', 'branch', 'products']).size()\
.unstack('branch')\
.reindex(['B','C','A'], axis=1, fill_value=0)\
.stack('branch')\
.unstack('products')\
.replace({np.nan:0})\
.transform(
lambda x: x*100/df.groupby(['store']).size()
).round(1)\
.reindex(['clothes', 'shoes', 'accessories', 'bags'], axis='columns')
Running the piece of code above, produces the desired output:
products accessories bags clothes shoes
store branch
1 B 0.0 0.0 0.0 0.0
C 0.0 0.0 33.3 0.0
A 0.0 0.0 33.3 33.3
2 B 0.0 0.0 0.0 0.0
C 33.3 0.0 33.3 33.3
3 B 0.0 0.0 0.0 0.0
C 0.0 0.0 25.0 0.0
A 0.0 50.0 25.0 0.0

For loops hangs when trying to use various previous row values in calculating a new column. Need to find a non-for loop solution

I have a dataset of 6.5 million rows in which each user ID's interaction with a token supplier are recorded.
The data is sorted by 'id' and 'Days'
The 'Days' column is the count of days since they joined the supplier.
The day on which a user is given tokens, it is mentioned in column token_SUPPLY.
Each day one token is used.
I want to create a column in which the number of available tokens for every row is mentioned.
The logic I've used is:
For each row check if we are still looking at the same user 'id'. If yes then check if any tokens have been supplied, if yes, then save the day number.
For each subsequent row of the same user, calculate available tokens the number of tokens supplied minus the number of days passed since the tokens were supplied.
currID=0
tokenSupply=0
giveDay=0
for row in df11.itertuples():
if row.id != currID:
tokenSupply = 0
currID= row.id
if row.token_SUPPLY > 0:
giveDay=row.Days
tokenSupply = row.token_SUPPLY
df11.loc[row.Index,"token_onhand"]=tokenSupply
else:
if tokenSupply == 0:
df11.loc[row.Index,"token_onhand"]=0
else:
df11.loc[row.Index,"token_onhand"]=tokenSupply-(row.Days-giveDay)
# For loop doesn't end for more than 50 minutes.
I've been reading a lot since last night and it seems that people have suggested using numpy, but I don't know how to do that as I'm just learning to use these things. The other suggestion was to #jit, but I guess that works only if I define a function.
Another suggestion was to vectorise, but how would I then access rows conditionally and remember the the supplied quantity to use in every subsequent row ?
I did try using np.where but it seemed to get too convoluted to wrap my around around it.
I also ready somewhere about Cython, but again, I have no idea how to do that properly.
What would be the best approach to achieve my objective ?
EDIT: Added sample data and required output column
Sample output data:
id Days token_SUPPLY give_event token_onhand
190 ID1001 -12 NaN 0 0.0
191 ID1001 -12 NaN 0 0.0
192 ID1001 -3 NaN 0 0.0
193 ID1001 0 5.0 0 5.0
194 ID1001 0 5.0 1 5.0
195 ID1001 6 NaN 0 -1.0
196 ID1001 12 NaN 0 -7.0
197 ID1001 12 NaN 0 -7.0
198 ID1001 13 NaN 0 -8.0
199 ID1001 13 NaN 0 -8.0
The last column token_onhand is not in the dataset, and is what actually needs to be generated.

If I understand correctly:
Sample Data:
id Days token_SUPPLY give_event
0 ID1001 -12 NaN 0
1 ID1001 -12 NaN 0
2 ID1001 -3 NaN 0
3 ID1001 0 5.0 0
4 ID1001 0 5.0 1
5 ID1001 6 NaN 0
6 ID1001 12 NaN 0
7 ID1001 12 NaN 0
8 ID1001 13 NaN 0
9 ID1001 13 NaN 0
10 ID1002 -12 NaN 0
11 ID1002 -12 NaN 0
12 ID1002 -3 NaN 0
13 ID1002 0 5.0 0
14 ID1002 0 5.0 1
15 ID1002 6 NaN 0
16 ID1002 12 NaN 0
17 ID1002 12 NaN 0
18 ID1002 13 NaN 0
19 ID1002 13 NaN 0
You can useffill on token_Supply and subtract Days. For more than on id use groupby.
df = pd.read_clipboard()
df['token_onhand'] = df.groupby('id').apply(lambda x: (x['token_SUPPLY'].ffill() - x['Days']).fillna(0)).reset_index(drop=True)
df
Result:
id Days token_SUPPLY give_event token_onhand
0 ID1001 -12 NaN 0 0.0
1 ID1001 -12 NaN 0 0.0
2 ID1001 -3 NaN 0 0.0
3 ID1001 0 5.0 0 5.0
4 ID1001 0 5.0 1 5.0
5 ID1001 6 NaN 0 -1.0
6 ID1001 12 NaN 0 -7.0
7 ID1001 12 NaN 0 -7.0
8 ID1001 13 NaN 0 -8.0
9 ID1001 13 NaN 0 -8.0
10 ID1002 -12 NaN 0 0.0
11 ID1002 -12 NaN 0 0.0
12 ID1002 -3 NaN 0 0.0
13 ID1002 0 5.0 0 5.0
14 ID1002 0 5.0 1 5.0
15 ID1002 6 NaN 0 -1.0
16 ID1002 12 NaN 0 -7.0
17 ID1002 12 NaN 0 -7.0
18 ID1002 13 NaN 0 -8.0
19 ID1002 13 NaN 0 -8.0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string