Change values in a column to np.nan based upon row index - python-3.x

I want to selectively change column values to np.nan.
I have a column with a lot of zero (0) values.
I am getting the row indices of a subset of the total.
I place the indices into a variable (s0).
I then use this to set the column value to np.nan for just the rows whose index is in s0.
It runs, but it is changing every single row (i.e., the entire column) to np.nan.
Here is my code:
print((df3['amount_tsh'] == 0).sum()) # 41639 <-- there are this many zeros to start
# print(df3['amount_tsh'].value_counts()[0])
s0 = df3['amount_tsh'][df3['amount_tsh'].eq(0)].sample(37322).index # grab 37322 row indexes
print(len(s0)) # 37322
df3['amount_tsh'] = df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan # change the value in the column to np.nan if it's index is in s0
print(df3['amount_tsh'].isnull().sum())

Lets try
s0 = df3.loc[df3['amount_tsh'].eq(0), ['amount_tsh']].sample(37322)
df3.loc[df3.index.isin(s0.index), 'amount_tsh'] = np.nan
For a quick fix I used this data I had in a notebook and it worked for me
import pandas as pd
import numpy as np
data = pd.DataFrame({'Symbol': {0: 'ABNB', 1: 'DKNG', 2: 'EXPE', 3: 'MPNGF', 4: 'RDFN', 5: 'ROKU', 6: 'VIACA', 7: 'Z'},
'Number of Buys': {0: np.nan, 1: 2.0, 2: np.nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: np.nan},
'Number of Sell s': {0: 1.0, 1: np.nan, 2: 1.0, 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan, 7: 1.0},
'Gains/Losses': {0: 2106.0, 1: -1479.2, 2: 1863.18, 3: -1980.0, 4: -1687.7, 5: -1520.52, 6: -1282.4, 7: 1624.59}, 'Percentage change': {0: 0.0, 1: 2.0, 2: 0.0, 3: 0.0, 4: 1.5, 5: 0.0, 6: 0.0, 7: 0.0}})
rows = ['ABNB','DKNG','EXPE']
data
Symbol Number of Buys Number of Sell s Gains/Losses \
0 ABNB NaN 1.0 2106.00
1 DKNG 2.0 NaN -1479.20
2 EXPE NaN 1.0 1863.18
3 MPNGF 1.0 NaN -1980.00
4 RDFN 2.0 NaN -1687.70
5 ROKU 1.0 NaN -1520.52
6 VIACA 1.0 NaN -1282.40
7 Z NaN 1.0 1624.59
Percentage change
0 0.0
1 2.0
2 0.0
3 0.0
4 1.5
5 0.0
6 0.0
7 0.0
By your approach
(data['Number of Buys']==1.0).sum()
s0= data.loc[(data['Number of Buys']==1.0),['Number of Buys']].sample(2)
data.loc[data.index.isin(s0.index),'Number of Buys'] =np.nan
Symbol Number of Buys Number of Sell s Gains/Losses \
0 ABNB NaN 1.0 2106.00
1 DKNG 2.0 NaN -1479.20
2 EXPE NaN 1.0 1863.18
3 MPNGF 1.0 NaN -1980.00
4 RDFN 2.0 NaN -1687.70
5 ROKU NaN NaN -1520.52
6 VIACA NaN NaN -1282.40
7 Z NaN 1.0 1624.59
Percentage change
0 0.0
1 2.0
2 0.0
3 0.0
4 1.5
5 0.0
6 0.0
7 0.0

Hmm...
I removed the re-assignment and it worked??
s0 = df3['amount_tsh'][df3['amount_tsh'].eq(0)].sample(37322).index
df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan
The second line was:
df3['amount_tsh'] = df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan

Related

When I try to handle missing values in pandas, some methods are not working

I am trying to handle some missing values in a dataset. This is the link for the tutorial that I am using to learn. Below is the code that I am using to read the data.
import pandas as pd
import numpy as np
questions = pd.read_csv("./archive/questions.csv")
print(questions.head())
This is what my data looks like
These are the methods that I am using to handle the missing values. None of them are working.
questions.replace(to_replace = np.nan, value = -99)
questions = questions.fillna(method ='pad')
questions.interpolate(method ='linear', limit_direction = 'forward')
Then I tried to drop the rows with the missing values. None of them are working either. All of them are returning Empty dataframe.
questions.dropna()
questions.dropna(how = "all")
questions.dropna(axis = 1)
What I am doing wrong?
Edit:
Values from questions.head()
[[1 '2008-07-31T21:26:37Z' nan '2011-03-28T00:53:47Z' 1 nan 0.0]
[4 '2008-07-31T21:42:52Z' nan nan 458 8.0 13.0]
[6 '2008-07-31T22:08:08Z' nan nan 207 9.0 5.0]
[8 '2008-07-31T23:33:19Z' '2013-06-03T04:00:25Z' '2015-02-11T08:26:40Z'
42 nan 8.0]
[9 '2008-07-31T23:40:59Z' nan nan 1410 1.0 58.0]]
Values from questions.head() in a dictionary form.
{'Id': {0: 1, 1: 4, 2: 6, 3: 8, 4: 9}, 'CreationDate': {0: '2008-07-31T21:26:37Z', 1: '2008-07-31T21:42:52Z', 2: '2008-07-31T22:08:08Z', 3: '2008-07-31T23:33:19Z', 4: '2008-07-31T23:40:59Z'}, 'ClosedDate': {0: nan, 1: nan, 2: nan, 3: '2013-06-03T04:00:25Z', 4: nan}, 'DeletionDate': {0: '2011-03-28T00:53:47Z', 1: nan, 2: nan, 3: '2015-02-11T08:26:40Z', 4: nan}, 'Score': {0: 1, 1: 458, 2: 207, 3: 42, 4: 1410}, 'OwnerUserId': {0: nan, 1: 8.0, 2: 9.0, 3: nan, 4: 1.0}, 'AnswerCount': {0: 0.0, 1: 13.0, 2: 5.0, 3: 8.0, 4: 58.0}}
Information regarding the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17203824 entries, 0 to 17203823
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 Id int64
1 CreationDate object
2 ClosedDate object
3 DeletionDate object
4 Score int64
5 OwnerUserId float64
6 AnswerCount float64
dtypes: float64(2), int64(2), object(3)
memory usage: 918.8+ MB
Can you try to specify the axis explicitly and see if it will work? The other fillna() should still work without axis, but for pad you need it so it knows how to fill the missing values.
>>> questions.fillna(method='pad', axis=1)
Id CreationDate ClosedDate DeletionDate Score OwnerUserId AnswerCount
0 1 2008-07-31T21:26:37Z 2008-07-31T21:26:37Z 2011-03-28T00:53:47Z 1 1 0
1 4 2008-07-31T21:42:52Z 2008-07-31T21:42:52Z 2008-07-31T21:42:52Z 458 8 13
2 6 2008-07-31T22:08:08Z 2008-07-31T22:08:08Z 2008-07-31T22:08:08Z 207 9 5
3 8 2008-07-31T23:33:19Z 2013-06-03T04:00:25Z 2015-02-11T08:26:40Z 42 42 8
4 9 2008-07-31T23:40:59Z 2008-07-31T23:40:59Z 2008-07-31T23:40:59Z 1410 1 58
just fillna() applied on entire DataFrame works as expected.
>>> questions.fillna('-')
Id CreationDate ClosedDate DeletionDate Score OwnerUserId AnswerCount
0 1 2008-07-31T21:26:37Z - 2011-03-28T00:53:47Z 1 - 0.0
1 4 2008-07-31T21:42:52Z - - 458 8 13.0
2 6 2008-07-31T22:08:08Z - - 207 9 5.0
3 8 2008-07-31T23:33:19Z 2013-06-03T04:00:25Z 2015-02-11T08:26:40Z 42 - 8.0
4 9 2008-07-31T23:40:59Z - - 1410 1 58.0

Replace NULL or NA in a column wrt to other column in pandas data frame [duplicate]

This question already has answers here:
Pandas conditional creation of a series/dataframe column
(13 answers)
Closed 3 years ago.
I have a table:
df = pd.DataFrame([[0.1, 2, 55, 0,np.nan],
[0.2, 4, np.nan, 1,99],
[0.6, np.nan, 22, 5,88],
[1.4, np.nan, np.nan, 4,77]],
columns=list('ABCDE'))
A B C D E
0 0.1 2.0 55.0 0 NaN
1 0.2 NaN NaN 1 99.0
2 0.6 NaN 22.0 5 88.0
3 1.4 NaN NaN 4 77.0
I want to replace NaN values in Column B based on condition on Column A.
Example:
When B is NULL and value in `column A > 0.2 and < 0.6` replace "NaN" in column B as 5
When B is NULL value in `column A > 0.6 and < 2` replace "NaN" in column B as 10
I tried something like this:
if df["A"]>=val1 and pd.isnull(df['B']):
df["B"]=5
elif df["A"]>=val2 and df["A"]<val3 and pd.isnull(df['B']):
df["B"]=10
elif df["A"]<val4 and pd.isnull(df['B']):
df["B"]=15
The above code is not working.
Please let me know is there any other alternative approach using for loop or apply functions to iterate over pandas dataframe.
You can use mask:
df['B'] = df['B'].mask((df['A']>0.2) & (df['A']<0.6), df['B'].fillna(5))
df['B'] = df['B'].mask((df['A']>0.6) & (df['A']<2), df['B'].fillna(10))
or you can try np.where but it will involve a long condition I guess.

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

Filling in nans for numbers in a column-specific way

Given a DataFrame and a list of indexes, is there an efficient pandas function that put nan value for all values vertically preceeding each of the entries of the list?
For example, suppose we have the list [4,8] and the following DataFrame:
index 0 1
5 1 2
2 9 3
4 3.2 3
8 9 8.7
The desired output is simply:
index 0 1
5 nan nan
2 nan nan
4 3.2 nan
8 9 8.7
Any suggestions for such a function that does this fast?
Here's one NumPy approach based on np.searchsorted -
s = [4,8]
a = df.values
idx = df.index.values
sidx = np.argsort(idx)
matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
mask = np.arange(a.shape[0])[:,None] < matching_row_indx
a[mask] = np.nan
Sample run -
In [107]: df
Out[107]:
0 1
index
5 1.0 2.0
2 9.0 3.0
4 3.2 3.0
8 9.0 8.7
In [108]: s = [4,8]
In [109]: a = df.values
...: idx = df.index.values
...: sidx = np.argsort(idx)
...: matching_row_indx = sidx[np.searchsorted(idx, s, sorter = sidx)]
...: mask = np.arange(a.shape[0])[:,None] < matching_row_indx
...: a[mask] = np.nan
...:
In [110]: df
Out[110]:
0 1
index
5 NaN NaN
2 NaN NaN
4 3.2 NaN
8 9.0 8.7
It was a bit tricky to recreate your example but this should do it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'index': [5, 2, 4, 8], 0: [1, 9, 3.2, 9], 1: [2, 3, 3, 8.7]})
df.set_index('index', inplace=True)
for i, item in enumerate([4,8]):
for index, row in df.iterrows():
if index != item:
row[i] = np.nan
else:
break

Updating values in a pandas dataframe using another dataframe

I have an existing pandas Dataframe with the following format:
sample_dict = {'ID': [100, 200, 300], 'a': [1, 2, 3], 'b': [.1, .2, .3], 'c': [4, 5, 6], 'd': [.4, .5, .6]}
df_sample = pd.DataFrame(sample_dict)
Now, I want to update df_sample using another dataframe that looks like this:
sample_update = {'ID': [100, 300], 'a': [3, 2], 'b': [.4, .2], 'c': [2, 5], 'd': [.7, .1]}
df_updater = pd.DataFrame(sample_update)
The rule for the update is this:
For column a and c, just add values from a and c in df_updater.
For column b, it depends on the updated value of a. Let's say the update function would be b = old_b + (new_b / updated_a).
For column d, the rules are similar to that of column b except that it depends on values of the updated c and new_d.
Here is the desired output:
new = {'ID': [100, 200, 300], 'a': [4, 2, 5], 'b': [.233333, .2, .33999999], 'c': [6, 5, 11], 'd': [.51666666, .5, .609090]}
df_new = pd.DataFrame(new)
My actual problems are using a little more complicated version of this but I think this example is enough to solve my problem. Also, In my real DataFrame, I have more columns following the same rules so I would like to make this method to loop over the columns if possible. Thanks!
You can use functions merge, add and div:
df = pd.merge(df_sample,df_updater,on='ID', how='left')
df[['a','c']] = df[['a_y','c_y']].add(df[['a_x','c_x']].values, fill_value=0)
df['b'] = df['b_x'].add(df['b_y'].div(df.a_y), fill_value=0)
df['d'] = df['c_x'].add(df['d_y'].div(df.c_y), fill_value=0)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y a c b d
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7 4.0 6.0 0.233333 4.35
1 200 2 0.2 5 0.5 NaN NaN NaN NaN 2.0 5.0 0.200000 5.00
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1 5.0 11.0 0.400000 6.02
print (df[['a','b','c','d']])
a b c d
0 4.0 0.233333 6.0 4.35
1 2.0 0.200000 5.0 5.00
2 5.0 0.400000 11.0 6.02
Instead merge is posible use concat:
df=pd.concat([df_sample.set_index('ID'),df_updater.set_index('ID')], axis=1,keys=('_x','_y'))
df.columns = [''.join((col[1], col[0])) for col in df.columns]
df.reset_index(inplace=True)
print (df)
ID a_x b_x c_x d_x a_y b_y c_y d_y
0 100 1 0.1 4 0.4 3.0 0.4 2.0 0.7
1 200 2 0.2 5 0.5 NaN NaN NaN NaN
2 300 3 0.3 6 0.6 2.0 0.2 5.0 0.1

Resources