When I try to handle missing values in pandas, some methods are not working - python-3.x

I am trying to handle some missing values in a dataset. This is the link for the tutorial that I am using to learn. Below is the code that I am using to read the data.
import pandas as pd
import numpy as np
questions = pd.read_csv("./archive/questions.csv")
print(questions.head())
This is what my data looks like
These are the methods that I am using to handle the missing values. None of them are working.
questions.replace(to_replace = np.nan, value = -99)
questions = questions.fillna(method ='pad')
questions.interpolate(method ='linear', limit_direction = 'forward')
Then I tried to drop the rows with the missing values. None of them are working either. All of them are returning Empty dataframe.
questions.dropna()
questions.dropna(how = "all")
questions.dropna(axis = 1)
What I am doing wrong?
Edit:
Values from questions.head()
[[1 '2008-07-31T21:26:37Z' nan '2011-03-28T00:53:47Z' 1 nan 0.0]
[4 '2008-07-31T21:42:52Z' nan nan 458 8.0 13.0]
[6 '2008-07-31T22:08:08Z' nan nan 207 9.0 5.0]
[8 '2008-07-31T23:33:19Z' '2013-06-03T04:00:25Z' '2015-02-11T08:26:40Z'
42 nan 8.0]
[9 '2008-07-31T23:40:59Z' nan nan 1410 1.0 58.0]]
Values from questions.head() in a dictionary form.
{'Id': {0: 1, 1: 4, 2: 6, 3: 8, 4: 9}, 'CreationDate': {0: '2008-07-31T21:26:37Z', 1: '2008-07-31T21:42:52Z', 2: '2008-07-31T22:08:08Z', 3: '2008-07-31T23:33:19Z', 4: '2008-07-31T23:40:59Z'}, 'ClosedDate': {0: nan, 1: nan, 2: nan, 3: '2013-06-03T04:00:25Z', 4: nan}, 'DeletionDate': {0: '2011-03-28T00:53:47Z', 1: nan, 2: nan, 3: '2015-02-11T08:26:40Z', 4: nan}, 'Score': {0: 1, 1: 458, 2: 207, 3: 42, 4: 1410}, 'OwnerUserId': {0: nan, 1: 8.0, 2: 9.0, 3: nan, 4: 1.0}, 'AnswerCount': {0: 0.0, 1: 13.0, 2: 5.0, 3: 8.0, 4: 58.0}}
Information regarding the dataset
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17203824 entries, 0 to 17203823
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 Id int64
1 CreationDate object
2 ClosedDate object
3 DeletionDate object
4 Score int64
5 OwnerUserId float64
6 AnswerCount float64
dtypes: float64(2), int64(2), object(3)
memory usage: 918.8+ MB

Can you try to specify the axis explicitly and see if it will work? The other fillna() should still work without axis, but for pad you need it so it knows how to fill the missing values.
>>> questions.fillna(method='pad', axis=1)
Id CreationDate ClosedDate DeletionDate Score OwnerUserId AnswerCount
0 1 2008-07-31T21:26:37Z 2008-07-31T21:26:37Z 2011-03-28T00:53:47Z 1 1 0
1 4 2008-07-31T21:42:52Z 2008-07-31T21:42:52Z 2008-07-31T21:42:52Z 458 8 13
2 6 2008-07-31T22:08:08Z 2008-07-31T22:08:08Z 2008-07-31T22:08:08Z 207 9 5
3 8 2008-07-31T23:33:19Z 2013-06-03T04:00:25Z 2015-02-11T08:26:40Z 42 42 8
4 9 2008-07-31T23:40:59Z 2008-07-31T23:40:59Z 2008-07-31T23:40:59Z 1410 1 58
just fillna() applied on entire DataFrame works as expected.
>>> questions.fillna('-')
Id CreationDate ClosedDate DeletionDate Score OwnerUserId AnswerCount
0 1 2008-07-31T21:26:37Z - 2011-03-28T00:53:47Z 1 - 0.0
1 4 2008-07-31T21:42:52Z - - 458 8 13.0
2 6 2008-07-31T22:08:08Z - - 207 9 5.0
3 8 2008-07-31T23:33:19Z 2013-06-03T04:00:25Z 2015-02-11T08:26:40Z 42 - 8.0
4 9 2008-07-31T23:40:59Z - - 1410 1 58.0

Related

Change values in a column to np.nan based upon row index

I want to selectively change column values to np.nan.
I have a column with a lot of zero (0) values.
I am getting the row indices of a subset of the total.
I place the indices into a variable (s0).
I then use this to set the column value to np.nan for just the rows whose index is in s0.
It runs, but it is changing every single row (i.e., the entire column) to np.nan.
Here is my code:
print((df3['amount_tsh'] == 0).sum()) # 41639 <-- there are this many zeros to start
# print(df3['amount_tsh'].value_counts()[0])
s0 = df3['amount_tsh'][df3['amount_tsh'].eq(0)].sample(37322).index # grab 37322 row indexes
print(len(s0)) # 37322
df3['amount_tsh'] = df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan # change the value in the column to np.nan if it's index is in s0
print(df3['amount_tsh'].isnull().sum())
Lets try
s0 = df3.loc[df3['amount_tsh'].eq(0), ['amount_tsh']].sample(37322)
df3.loc[df3.index.isin(s0.index), 'amount_tsh'] = np.nan
For a quick fix I used this data I had in a notebook and it worked for me
import pandas as pd
import numpy as np
data = pd.DataFrame({'Symbol': {0: 'ABNB', 1: 'DKNG', 2: 'EXPE', 3: 'MPNGF', 4: 'RDFN', 5: 'ROKU', 6: 'VIACA', 7: 'Z'},
'Number of Buys': {0: np.nan, 1: 2.0, 2: np.nan, 3: 1.0, 4: 2.0, 5: 1.0, 6: 1.0, 7: np.nan},
'Number of Sell s': {0: 1.0, 1: np.nan, 2: 1.0, 3: np.nan, 4: np.nan, 5: np.nan, 6: np.nan, 7: 1.0},
'Gains/Losses': {0: 2106.0, 1: -1479.2, 2: 1863.18, 3: -1980.0, 4: -1687.7, 5: -1520.52, 6: -1282.4, 7: 1624.59}, 'Percentage change': {0: 0.0, 1: 2.0, 2: 0.0, 3: 0.0, 4: 1.5, 5: 0.0, 6: 0.0, 7: 0.0}})
rows = ['ABNB','DKNG','EXPE']
data
Symbol Number of Buys Number of Sell s Gains/Losses \
0 ABNB NaN 1.0 2106.00
1 DKNG 2.0 NaN -1479.20
2 EXPE NaN 1.0 1863.18
3 MPNGF 1.0 NaN -1980.00
4 RDFN 2.0 NaN -1687.70
5 ROKU 1.0 NaN -1520.52
6 VIACA 1.0 NaN -1282.40
7 Z NaN 1.0 1624.59
Percentage change
0 0.0
1 2.0
2 0.0
3 0.0
4 1.5
5 0.0
6 0.0
7 0.0
By your approach
(data['Number of Buys']==1.0).sum()
s0= data.loc[(data['Number of Buys']==1.0),['Number of Buys']].sample(2)
data.loc[data.index.isin(s0.index),'Number of Buys'] =np.nan
Symbol Number of Buys Number of Sell s Gains/Losses \
0 ABNB NaN 1.0 2106.00
1 DKNG 2.0 NaN -1479.20
2 EXPE NaN 1.0 1863.18
3 MPNGF 1.0 NaN -1980.00
4 RDFN 2.0 NaN -1687.70
5 ROKU NaN NaN -1520.52
6 VIACA NaN NaN -1282.40
7 Z NaN 1.0 1624.59
Percentage change
0 0.0
1 2.0
2 0.0
3 0.0
4 1.5
5 0.0
6 0.0
7 0.0
Hmm...
I removed the re-assignment and it worked??
s0 = df3['amount_tsh'][df3['amount_tsh'].eq(0)].sample(37322).index
df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan
The second line was:
df3['amount_tsh'] = df3.loc[df3.index.isin(s0), 'amount_tsh'] = np.nan

Convert string "nan" to numpy NaN in pandas Dataframe that contains Strings [duplicate]

I am working on a large dataset with many columns of different types. There are a mix of numeric values and strings with some NULL values. I need to change the NULL Value to Blank or 0 depending on the type.
1 John 2 Doe 3 Mike 4 Orange 5 Stuff
9 NULL NULL NULL 8 NULL NULL Lemon 12 NULL
I want it to look like this,
1 John 2 Doe 3 Mike 4 Orange 5 Stuff
9 0 8 0 Lemon 12
I can do this for each individual, but since I am going to be pulling several extremely large datasets with hundreds of columns, I'd like to do this some other way.
Edit:
Types from Smaller Dataset,
Field1 object
Field2 object
Field3 object
Field4 object
Field5 object
Field6 object
Field7 object
Field8 object
Field9 object
Field10 float64
Field11 float64
Field12 float64
Field13 float64
Field14 float64
Field15 object
Field16 float64
Field17 object
Field18 object
Field19 float64
Field20 float64
Field21 int64
Use DataFrame.select_dtypes for numeric columns, filter by subset and replace values to 0, then repalce all another columns to empty string:
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 NaN NaN NaN 8 NaN NaN Lemon 12 NaN
print (df.dtypes)
0 int64
1 object
2 float64
3 object
4 int64
5 object
6 float64
7 object
8 int64
9 object
dtype: object
c = df.select_dtypes(np.number).columns
df[c] = df[c].fillna(0)
df = df.fillna("")
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 0.0 8 0.0 Lemon 12
Another solution is create dictionary for replace:
num_cols = df.select_dtypes(np.number).columns
d1 = dict.fromkeys(num_cols, 0)
d2 = dict.fromkeys(df.columns.difference(num_cols), "")
d = {**d1, **d2}
print (d)
{0: 0, 2: 0, 4: 0, 6: 0, 8: 0, 1: '', 3: '', 5: '', 7: '', 9: ''}
df = df.fillna(d)
print (df)
0 1 2 3 4 5 6 7 8 9
0 1 John 2.0 Doe 3 Mike 4.0 Orange 5 Stuff
1 9 0.0 8 0.0 Lemon 12
You could try this to substitute a different value for each different column (A to C are numeric, while D is a string):
import pandas as pd
import numpy as np
df_pd = pd.DataFrame([[np.nan, 2, np.nan, '0'],
[3, 4, np.nan, '1'],
[np.nan, np.nan, np.nan, '5'],
[np.nan, 3, np.nan, np.nan]],
columns=list('ABCD'))
df_pd.fillna(value={'A':0.0,'B':0.0,'C':0.0,'D':''})

using duplicates values from one column to remove entire row in pandas dataframe

I have the data in the .csv file uploaded in the following link
Click here for the data
In this file, I have the following columns:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
There will be duplicates in the columns Team. Another column is SimStage. Simstage has a series containing data from 0 to N (in this case 0 to 4)
I would like to keep row for each team at each Simstage value (i.e. the rest will be removed). when we remove, the duplicates row with lower value in the column Points will be removed for each Team and SimStage.
Since it is slightly difficult to explain using words alone, I attached a picture here.
In this picture, the row with highlighted in red boxes will be be removed.
I used df.duplicates() but it does not work.
It looks like you want to only keep the highest value from the 'Points' column. Therefore, use the first aggregation function in pandas
Create the dataframe and call it df
data = {'Team': {0: 'Brazil', 1: 'Brazil', 2: 'Brazil', 3: 'Brazil', 4: 'Brazil', 5: 'Brazil', 6: 'Brazil', 7: 'Brazil', 8: 'Brazil', 9: 'Brazil'},
'Group': {0: 'Group E', 1: 'Group E', 2: 'Group E', 3: 'Group E', 4: 'Group E', 5: 'Group E', 6: 'Group E', 7: 'Group E', 8: 'Group E', 9: 'Group E'},
'Model': {0: 'ELO', 1: 'ELO', 2: 'ELO', 3: 'ELO', 4: 'ELO', 5: 'ELO', 6: 'ELO', 7: 'ELO', 8: 'ELO', 9: 'ELO'},
'SimStage': {0: 0, 1: 0, 2: 1, 3: 1, 4: 2, 5: 2, 6: 3, 7: 3, 8: 4, 9: 4},
'Points': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 1, 6: 2, 7: 4, 8: 4, 9: 1},
'GpWinner': {0: 0.2, 1: 0.2, 2: 0.2, 3: 0.2, 4: 0.2, 5: 0.0, 6: 0.2, 7: 0.2, 8: 0.2, 9: 0.0},
'GpRunnerup': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0, 5: 0.2, 6: 0.0, 7: 0.0, 8: 0.0, 9: 0.2},
'3rd': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0},
'4th': {0: 0, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0}}
df = pd.DataFrame(data)
# To be able to output the dataframe in your original order
columns_order = ['Team', 'Group', 'Model', 'SimStage', 'Points', 'GpWinner', 'GpRunnerup', '3rd', '4th']
Method 1
# Sort the values by 'Points' descending and 'SimStage' ascending
df = df.sort_values('Points', ascending=False)
df = df.sort_values('SimStage')
# Group the columns by the necessary columns
df = df.groupby(['Team', 'SimStage'], as_index=False).agg('first')
# Output the dataframe in the orginal order
df[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
1 Brazil Group E ELO 1 4 0.2 0.0 0 0
2 Brazil Group E ELO 2 4 0.2 0.0 0 0
3 Brazil Group E ELO 3 4 0.2 0.0 0 0
4 Brazil Group E ELO 4 4 0.2 0.0 0 0
Method 2
df.sort_values('Points', ascending=False).drop_duplicates(['Team', 'SimStage'])[columns_order]
Out[]:
Team Group Model SimStage Points GpWinner GpRunnerup 3rd 4th
0 Brazil Group E ELO 0 4 0.2 0.0 0 0
2 Brazil Group E ELO 1 4 0.2 0.0 0 0
4 Brazil Group E ELO 2 4 0.2 0.0 0 0
7 Brazil Group E ELO 3 4 0.2 0.0 0 0
8 Brazil Group E ELO 4 4 0.2 0.0 0 0
I am just creating a mini-dataset based on your dataset here with Team, SimStage and Points.
import pandas as pd
namesDf = pd.DataFrame()
namesDf['Team'] = ['Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil', 'Brazil']
namesDf['SimStage'] = [0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
namesDf['Points'] = [4, 4, 4, 4, 4, 1, 2, 4, 4, 1]
Now, for each Sim Stage, you want the highest Point value. So, I first group them by Team and Sim Stage and then Sort them by Points.
namesDf = namesDf.groupby(['Team', 'SimStage'], as_index = False).apply(lambda x: x.sort_values(['Points'], ascending = False)).reset_index(drop = True)
This will make my dataframe look like this, notice the change in Sim Stage with value 3:
Team SimStage Points
0 Brazil 0 4
1 Brazil 0 4
2 Brazil 1 4
3 Brazil 1 4
4 Brazil 2 4
5 Brazil 2 1
6 Brazil 3 4
7 Brazil 3 2
8 Brazil 4 4
9 Brazil 4 1
And now I remove the duplicates by keeping the first instance of every team and sim stage.
namesDf = namesDf.drop_duplicates(subset=['Team', 'SimStage'], keep = 'first')
Final result:
Team SimStage Points
0 Brazil 0 4
2 Brazil 1 4
4 Brazil 2 4
6 Brazil 3 4
8 Brazil 4 4

Efficient evaluation of weighted average variable in a Pandas Dataframe

Please, considere the dataframe df generated below:
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'var1': [10, 20, 30, 20 , 30, 30],
'var2': [2,4,6,4,6,6],
'price': [20, 30, 40 , 50, 10, 20],
'sells': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'var1','var2', 'price', 'sells'])
return df
if __name__=="__main__":
df=creatingDataFrame()
setCode=set(df['code'])
listDF=[]
for code in setCode:
dfCode=df[df['code'] == code].copy()
print(dfCode)
lenDfCode=len(dfCode)
if(lenDfCode==1):
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['price'].iloc[0]],
'totalSells': [dfCode['sells'].iloc[0]]
}
else:
dfCode['price*sells']=dfCode['price']*dfCode['sells']
sumSells=np.sum(dfCode['sells'])
sumProducts=np.sum(dfCode['price*sells'])
dfCode['totalSells']=sumSells
av=sumProducts/sumSells
dfCode['averagePrice']=av
theData={'code': [dfCode['code'].iloc[0]],
'var1': [dfCode['var1'].iloc[0]],
'var2': [dfCode['var2'].iloc[0]],
'averagePrice': [dfCode['averagePrice'].iloc[0]],
'totalSells': [dfCode['totalSells'].iloc[0]]
}
dfPart=pd.DataFrame(theData, columns = ['code', 'var1','var2', 'averagePrice','totalSells'])
listDF.append(dfPart)
newDF = pd.concat(listDF)
print(newDF)
I have this dataframe
code var1 var2 price sells
0 1 10 2 20 3
1 2 20 4 30 4
2 3 30 6 40 5
3 2 20 4 50 1
4 3 30 6 10 2
5 3 30 6 20 3
I want to generate the following dataframe:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3
0 2 20 4 34.0 5
0 3 30 6 28.0 10
Note that this dataframe is created from the first by evaluating the average price and total sells for each code. Furthermore, var1 and var2 are the same for each code. The python code above does that, but I know that it is inefficient. I believe that a desired solution can be done using groupby, but I am not able to generate it.
It is different , apply with pd.Series
df.groupby(['code','var1','var2']).apply(lambda x : pd.Series({'averagePrice': sum(x['sells']*x['price'])/sum(x['sells']),'totalSells':sum(x['sells'])})).reset_index()
Out[366]:
code var1 var2 averagePrice totalSells
0 1 10 2 20.0 3.0
1 2 20 4 34.0 5.0
2 3 30 6 28.0 10.0

Delete rows of a pandas data frame having string values in python 3.4.1

I have read a csv file with pandas read_csv having 8 columns. Each column may contain int/string/float values. But I want to remove those rows having string values and return a data frame with only numeric values in it. Attaching the csv sample.
I have tried to run this following code:
import pandas as pd
import numpy as np
df = pd.read_csv('new200_with_errors.csv',dtype={'Geo_Level_1' : int,'Geo_Level_2' : int,'Geo_Level_3' : int,'Product_Level_1' : int,'Product_Level_2' : int,'Product_Level_3' : int,'Total_Sale' : float})
print(df)
but I get the following error:
TypeError: unorderable types: NoneType() > int()
I am running with python 3.4.1.
Here is the sample csv.
Geo_L_1,Geo_L_2,Geo_L_3,Pro_L_1,Pro_L_2,Pro_L_3,Date,Sale
1, 2, 3, 129, 1, 5193316745, 1/1/2012, 9
1 ,2, 3, 129, 1, 5193316745, 1/1/2013,
1, 2, 3, 129, 1, 5193316745, , 8
1, 2, 3, 129, NA, 5193316745, 1/10/2012, 10
1, 2, 3, 129, 1, 5193316745, 1/10/2013, 4
1, 2, 3, ghj, 1, 5193316745, 1/10/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/11/2012, 4
1, 2, 3, 129, 1, ghgj, 1/11/2013, 2
1, 2, 3, 129, 1, 5193316745, 1/11/2014, 6
1, 2, 3, 129, 1, 5193316745, 1/12/2012, ghgj
1, 2, 3, 129, 1, 5193316745, 1/12/2013, 5
So the way I would approach this is to try to convert the columns to an int using a user function with a Try/Catch to handle the situation where the value cannot be coerced into an Int, these get set to NaN values. Drop the row where you have an empty value, for some reason it actually has a length of 1 when I tested this with your data, it may work for you using len 0.
In [42]:
# simple function to try to convert the type, returns NaN if the value cannot be coerced
def func(x):
try:
return int(x)
except ValueError:
return NaN
# assign multiple columns
df['Pro_L_1'], df['Pro_L_3'], df['Sale'] = df['Pro_L_1'].apply(func), df['Pro_L_3'].apply(func), df['Sale'].apply(func)
# drop the 'empty' date row, take a copy() so we don't get a warning
df = df.loc[df['Date'].str.len() > 1].copy()
# convert the string to a datetime, if we didn't drop the row it would set the empty row to today's date
df['Date']= pd.to_datetime(df['Date'])
# now convert all the dtypes that are numeric to a numeric dtype
df = df.convert_objects(convert_numeric=True)
# check the dtypes
df.dtypes
Out[42]:
Geo_L_1 int64
Geo_L_2 int64
Geo_L_3 int64
Pro_L_1 float64
Pro_L_2 float64
Pro_L_3 float64
Date datetime64[ns]
Sale float64
dtype: object
In [43]:
# display the current situation
df
Out[43]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
1 1 2 3 129 1 5193316745 2013-01-01 NaN
3 1 2 3 129 NaN 5193316745 2012-01-10 10
4 1 2 3 129 1 5193316745 2013-01-10 4
5 1 2 3 NaN 1 5193316745 2014-01-10 6
6 1 2 3 129 1 5193316745 2012-01-11 4
7 1 2 3 129 1 NaN 2013-01-11 2
8 1 2 3 129 1 5193316745 2014-01-11 6
9 1 2 3 129 1 5193316745 2012-01-12 NaN
10 1 2 3 129 1 5193316745 2013-01-12 5
In [44]:
# drop the rows
df.dropna()
Out[44]:
Geo_L_1 Geo_L_2 Geo_L_3 Pro_L_1 Pro_L_2 Pro_L_3 Date Sale
0 1 2 3 129 1 5193316745 2012-01-01 9
4 1 2 3 129 1 5193316745 2013-01-10 4
6 1 2 3 129 1 5193316745 2012-01-11 4
8 1 2 3 129 1 5193316745 2014-01-11 6
10 1 2 3 129 1 5193316745 2013-01-12 5
For the last line assign it so df = df.dropna()

Resources