Assign values to pandas column based on condition [duplicate] - python-3.x

I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment.
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})
# desired output
a b
1 1
1 1
2 2
2 2
2 2
Here are the three solutions that I've tried so far.
# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')
# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')
All three of these produce my desired output, but the first two take a really long time on my data set, and the third solution is more memory intensive and feels rather clunky. Are there any other ways to forward fill a column?

You need to sort by both columns df.sort_values(['a', 'b']).ffill() to ensure robustness. If an np.nan is left in the first position within a group, ffill will fill that with a value from the prior group. Because np.nan will be placed at the end of any sort, sorting by both a and b ensures that you will not have np.nan at the front of any group. You can then .loc or .reindex with the initial index to get back your original order.
This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.
demo
Consider the dataframe df
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})
print(df)
a b
0 1 1.0
1 1 NaN
2 2 NaN
3 2 2.0
4 2 NaN
Try
df.sort_values('a').ffill()
a b
0 1 1.0
1 1 1.0
2 2 1.0 # <--- this is incorrect
3 2 2.0
4 2 2.0
Instead do
df.sort_values(['a', 'b']).ffill().loc[df.index]
a b
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 2 2.0
special note
This is still incorrect if an entire group has missing values

Using ffill() directly will give the best results. Here is the comparison
%timeit df.b.ffill(inplace = True)
best of 3: 311 µs per loop
%timeit df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
best of 3: 2.34 ms per loop
%timeit df['b'] = df.groupby('a')['b'].fillna(method='ffill')
best of 3: 4.41 ms per loop

what about this
df.groupby('a').b.transform('ffill')

Related

Add Multiindex Dataframe and corresponding Series

I am failing to add a multiindex dataframe and a corresponding series. E.g.,
df = pd.DataFrame({
'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1],
'c': [1, 2, 3, 4], 'd':[1, 1, 1, 1]}).set_index(['a', 'b'])
# Dataframe might contain records that are not in the series and vice versa
s = df['d'].iloc[1:]
df + s
produces
ValueError: cannot join with no overlapping index names
Does anyone know how to resolve this? I can work around the issue by adding each column separately, using e.g.
df['d'] + s
But I would like to add the two in a single operation. Any help is much appreciated.
By default, + tries to align along columns, the following would work with +:
s = df.iloc[:, 1:]
df + s
# c d
#a b
#0 0 NaN 2
# 1 NaN 2
#1 0 NaN 2
# 1 NaN 2
In your case, you need to align along index. You can explicitly specify axis=0 with add method for that:
df.add(s, axis=0)
# c d
#a b
#0 0 NaN NaN
# 1 3.0 2.0
#1 0 4.0 2.0
# 1 5.0 2.0

Flatten json column in Python [duplicate]

I have data saved in a postgreSQL database. I am querying this data using Python2.7 and turning it into a Pandas DataFrame. However, the last column of this dataframe has a dictionary of values inside it. The DataFrame df looks like this:
Station ID Pollutants
8809 {"a": "46", "b": "3", "c": "12"}
8810 {"a": "36", "b": "5", "c": "8"}
8811 {"b": "2", "c": "7"}
8812 {"c": "11"}
8813 {"a": "82", "c": "15"}
I need to split this column into separate columns, so that the DataFrame `df2 looks like this:
Station ID a b c
8809 46 3 12
8810 36 5 8
8811 NaN 2 7
8812 NaN NaN 11
8813 82 NaN 15
The major issue I'm having is that the lists are not the same lengths. But all of the lists only contain up to the same 3 values: 'a', 'b', and 'c'. And they always appear in the same order ('a' first, 'b' second, 'c' third).
The following code USED to work and return exactly what I wanted (df2).
objs = [df, pandas.DataFrame(df['Pollutant Levels'].tolist()).iloc[:, :3]]
df2 = pandas.concat(objs, axis=1).drop('Pollutant Levels', axis=1)
print(df2)
I was running this code just last week and it was working fine. But now my code is broken and I get this error from line [4]:
IndexError: out-of-bounds on slice (end)
I made no changes to the code but am now getting the error. I feel this is due to my method not being robust or proper.
Any suggestions or guidance on how to split this column of lists into separate columns would be super appreciated!
EDIT: I think the .tolist() and .apply methods are not working on my code because it is one Unicode string, i.e.:
#My data format
u{'a': '1', 'b': '2', 'c': '3'}
#and not
{u'a': '1', u'b': '2', u'c': '3'}
The data is imported from the postgreSQL database in this format. Any help or ideas with this issue? is there a way to convert the Unicode?
To convert the string to an actual dict, you can do df['Pollutant Levels'].map(eval). Afterwards, the solution below can be used to convert the dict to different columns.
Using a small example, you can use .apply(pd.Series):
In [2]: df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
In [3]: df
Out[3]:
a b
0 1 {u'c': 1}
1 2 {u'd': 3}
2 3 {u'c': 5, u'd': 6}
In [4]: df['b'].apply(pd.Series)
Out[4]:
c d
0 1.0 NaN
1 NaN 3.0
2 5.0 6.0
To combine it with the rest of the dataframe, you can concat the other columns with the above result:
In [7]: pd.concat([df.drop(['b'], axis=1), df['b'].apply(pd.Series)], axis=1)
Out[7]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
Using your code, this also works if I leave out the iloc part:
In [15]: pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
Out[15]:
a c d
0 1 1.0 NaN
1 2 NaN 3.0
2 3 5.0 6.0
I know the question is quite old, but I got here searching for answers. There is actually a better (and faster) way now of doing this using json_normalize:
import pandas as pd
df2 = pd.json_normalize(df['Pollutant Levels'])
This avoids costly apply functions...
The fastest method to normalize a column of flat, one-level dicts, as per the timing analysis performed by Shijith in this answer:
df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
It will not resolve other issues, with columns of list or dicts, that are addressed below, such as rows with NaN, or nested dicts.
pd.json_normalize(df.Pollutants) is significantly faster than df.Pollutants.apply(pd.Series)
See the %%timeit below. For 1M rows, .json_normalize is 47 times faster than .apply.
Whether reading data from a file, or from an object returned by a database, or API, it may not be clear if the dict column has dict or str type.
If the dictionaries in the column are str type, they must be converted back to a dict type, using ast.literal_eval, or json.loads(…).
Use pd.json_normalize to convert the dicts, with keys as headers and values for rows.
There are additional parameters (e.g. record_path & meta) for dealing with nested dicts.
Use pandas.DataFrame.join to combine the original DataFrame, df, with the columns created using pd.json_normalize
If the index isn't integers (as in the example), first use df.reset_index() to get an index of integers, before doing the normalize and join.
pandas.DataFrame.pop is used to remove the specified column from the existing dataframe. This removes the need to drop the column later, using pandas.DataFrame.drop.
As a note, if the column has any NaN, they must be filled with an empty dict
df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index})
If the 'Pollutants' column is strings, use '{}'.
Also see How to json_normalize a column with NaNs.
import pandas as pd
from ast import literal_eval
import numpy as np
data = {'Station ID': [8809, 8810, 8811, 8812, 8813, 8814],
'Pollutants': ['{"a": "46", "b": "3", "c": "12"}', '{"a": "36", "b": "5", "c": "8"}', '{"b": "2", "c": "7"}', '{"c": "11"}', '{"a": "82", "c": "15"}', np.nan]}
df = pd.DataFrame(data)
# display(df)
Station ID Pollutants
0 8809 {"a": "46", "b": "3", "c": "12"}
1 8810 {"a": "36", "b": "5", "c": "8"}
2 8811 {"b": "2", "c": "7"}
3 8812 {"c": "11"}
4 8813 {"a": "82", "c": "15"}
5 8814 NaN
# check the type of the first value in Pollutants
>>> print(type(df.iloc[0, 1]))
<class 'str'>
# replace NaN with '{}' if the column is strings, otherwise replace with {}
df.Pollutants = df.Pollutants.fillna('{}') # if the NaN is in a column of strings
# df.Pollutants = df.Pollutants.fillna({i: {} for i in df.index}) # if the column is not strings
# Convert the column of stringified dicts to dicts
# skip this line, if the column contains dicts
df.Pollutants = df.Pollutants.apply(literal_eval)
# reset the index if the index is not unique integers from 0 to n-1
# df.reset_index(inplace=True) # uncomment if needed
# remove and normalize the column of dictionaries, and join the result to df
df = df.join(pd.json_normalize(df.pop('Pollutants')))
# display(df)
Station ID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
5 8814 NaN NaN NaN
%%timeit
# dataframe with 1M rows
dfb = pd.concat([df]*20000).reset_index(drop=True)
%%timeit
dfb.join(pd.json_normalize(dfb.Pollutants))
[out]:
46.9 ms ± 201 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
pd.concat([dfb.drop(columns=['Pollutants']), dfb.Pollutants.apply(pd.Series)], axis=1)
[out]:
7.75 s ± 52.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try this: The data returned from SQL has to converted into a Dict.
or could it be "Pollutant Levels" is now Pollutants'
StationID Pollutants
0 8809 {"a":"46","b":"3","c":"12"}
1 8810 {"a":"36","b":"5","c":"8"}
2 8811 {"b":"2","c":"7"}
3 8812 {"c":"11"}
4 8813 {"a":"82","c":"15"}
df2["Pollutants"] = df2["Pollutants"].apply(lambda x : dict(eval(x)) )
df3 = df2["Pollutants"].apply(pd.Series )
a b c
0 46 3 12
1 36 5 8
2 NaN 2 7
3 NaN NaN 11
4 82 NaN 15
result = pd.concat([df, df3], axis=1).drop('Pollutants', axis=1)
result
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
I strongly recommend the method extract the column 'Pollutants':
df_pollutants = pd.DataFrame(df['Pollutants'].values.tolist(), index=df.index)
it's much faster than
df_pollutants = df['Pollutants'].apply(pd.Series)
when the size of df is giant.
Merlin's answer is better and super easy, but we don't need a lambda function. The evaluation of dictionary can be safely ignored by either of the following two ways as illustrated below:
Way 1: Two steps
# step 1: convert the `Pollutants` column to Pandas dataframe series
df_pol_ps = data_df['Pollutants'].apply(pd.Series)
df_pol_ps:
a b c
0 46 3 12
1 36 5 8
2 NaN 2 7
3 NaN NaN 11
4 82 NaN 15
# step 2: concat columns `a, b, c` and drop/remove the `Pollutants`
df_final = pd.concat([df, df_pol_ps], axis = 1).drop('Pollutants', axis = 1)
df_final:
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
Way 2: The above two steps can be combined in one go:
df_final = pd.concat([df, df['Pollutants'].apply(pd.Series)], axis = 1).drop('Pollutants', axis = 1)
df_final:
StationID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
Note : for dictionary with depth=1 (one-level)
>>> df
Station ID Pollutants
0 8809 {"a": "46", "b": "3", "c": "12"}
1 8810 {"a": "36", "b": "5", "c": "8"}
2 8811 {"b": "2", "c": "7"}
3 8812 {"c": "11"}
4 8813 {"a": "82", "c": "15"}
speed comparison for a large dataset of 10 million rows
>>> df = pd.concat([df]*2000000).reset_index(drop=True)
>>> print(df.shape)
(10000000, 2)
def apply_drop(df):
return df.join(df['Pollutants'].apply(pd.Series)).drop('Pollutants', axis=1)
def json_normalise_drop(df):
return df.join(pd.json_normalize(df.Pollutants)).drop('Pollutants', axis=1)
def tolist_drop(df):
return df.join(pd.DataFrame(df['Pollutants'].tolist())).drop('Pollutants', axis=1)
def vlues_tolist_drop(df):
return df.join(pd.DataFrame(df['Pollutants'].values.tolist())).drop('Pollutants', axis=1)
def pop_tolist(df):
return df.join(pd.DataFrame(df.pop('Pollutants').tolist()))
def pop_values_tolist(df):
return df.join(pd.DataFrame(df.pop('Pollutants').values.tolist()))
>>> %timeit apply_drop(df.copy())
1 loop, best of 3: 53min 20s per loop
>>> %timeit json_normalise_drop(df.copy())
1 loop, best of 3: 54.9 s per loop
>>> %timeit tolist_drop(df.copy())
1 loop, best of 3: 6.62 s per loop
>>> %timeit vlues_tolist_drop(df.copy())
1 loop, best of 3: 6.63 s per loop
>>> %timeit pop_tolist(df.copy())
1 loop, best of 3: 5.99 s per loop
>>> %timeit pop_values_tolist(df.copy())
1 loop, best of 3: 5.94 s per loop
+---------------------+-----------+
| apply_drop | 53min 20s |
| json_normalise_drop | 54.9 s |
| tolist_drop | 6.62 s |
| vlues_tolist_drop | 6.63 s |
| pop_tolist | 5.99 s |
| pop_values_tolist | 5.94 s |
+---------------------+-----------+
df.join(pd.DataFrame(df.pop('Pollutants').values.tolist())) is the fastest
How do I split a column of dictionaries into separate columns with pandas?
pd.DataFrame(df['val'].tolist()) is the canonical method for exploding a column of dictionaries
Here's your proof using a colorful graph.
Benchmarking code for reference.
Note that I am only timing the explosion since that's the most interesting part of answering this question - other aspects of result construction (such as whether to use pop or drop) are tangential to the discussion and can be ignored (it should be noted however that using pop avoids the followup drop call, so the final solution is a bit more performant, but we are still listifying the column and passing it to pd.DataFrame either way).
Additionally, pop destructively mutates the input DataFrame, making it harder to run in benchmarking code which assumes the input is not changed across test runs.
Critique of other solutions
df['val'].apply(pd.Series) is extremely slow for large N as pandas constructs Series objects for each row, then proceeds to construct a DataFrame from them. For larger N the performance dips to the order of minutes or hours.
pd.json_normalize(df['val'])) is slower simply because json_normalize is meant to work with a much more complex input data - particularly deeply nested JSON with multiple record paths and metadata. We have a simple flat dict for which pd.DataFrame suffices, so use that if your dicts are flat.
Some answers suggest df.pop('val').values.tolist() or df.pop('val').to_numpy().tolist(). I don't think it makes much of a difference whether you listify the series or the numpy array. It's one operation less to listify the series directly and really isn't slower so I'd recommend avoiding generating the numpy array in the intermediate step.
You can use join with pop + tolist. Performance is comparable to concat with drop + tolist, but some may find this syntax cleaner:
res = df.join(pd.DataFrame(df.pop('b').tolist()))
Benchmarking with other methods:
df = pd.DataFrame({'a':[1,2,3], 'b':[{'c':1}, {'d':3}, {'c':5, 'd':6}]})
def joris1(df):
return pd.concat([df.drop('b', axis=1), df['b'].apply(pd.Series)], axis=1)
def joris2(df):
return pd.concat([df.drop('b', axis=1), pd.DataFrame(df['b'].tolist())], axis=1)
def jpp(df):
return df.join(pd.DataFrame(df.pop('b').tolist()))
df = pd.concat([df]*1000, ignore_index=True)
%timeit joris1(df.copy()) # 1.33 s per loop
%timeit joris2(df.copy()) # 7.42 ms per loop
%timeit jpp(df.copy()) # 7.68 ms per loop
One line solution is following:
>>> df = pd.concat([df['Station ID'], df['Pollutants'].apply(pd.Series)], axis=1)
>>> print(df)
Station ID a b c
0 8809 46 3 12
1 8810 36 5 8
2 8811 NaN 2 7
3 8812 NaN NaN 11
4 8813 82 NaN 15
df = pd.concat([df['a'], df.b.apply(pd.Series)], axis=1)
I've concatenated those steps in a method, you have to pass only the dataframe and the column which contains the dict to expand:
def expand_dataframe(dw: pd.DataFrame, column_to_expand: str) -> pd.DataFrame:
"""
dw: DataFrame with some column which contain a dict to expand
in columns
column_to_expand: String with column name of dw
"""
import pandas as pd
def convert_to_dict(sequence: str) -> Dict:
import json
s = sequence
json_acceptable_string = s.replace("'", "\"")
d = json.loads(json_acceptable_string)
return d
expanded_dataframe = pd.concat([dw.drop([column_to_expand], axis=1),
dw[column_to_expand]
.apply(convert_to_dict)
.apply(pd.Series)],
axis=1)
return expanded_dataframe
my_df = pd.DataFrame.from_dict(my_dict, orient='index', columns=['my_col'])
.. would have parsed the dict properly (putting each dict key into a separate df column, and key values into df rows), so the dicts would not get squashed into a single column in the first place.

Loop over columns with df.shift in Python

Lets say you have a dataframe like this:
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
df
A B
0 3 5
1 1 6
2 2 7
3 3 8
Now I want to skew and calculate on each column. I put the values as I want them skewed in the index:
range_span = range(4)
result = pd.DataFrame(index=range_span)
Then I try to pupulate result with the following:
for c in df.columns:
for i in range_span:
result.iloc[i][c] = df[c].shift(i).max()
result
This only returns the index. I expected something like this:
You've got 3 critical issues:
issue #1
At this line
result.iloc[i][c] = df[c].shift(i).max()
Raises warning that help understand why result is empty.
...\pandas\core\indexing.py:670: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
According to their document:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
As iloc[i] will return slice - aka copy - of that rows, you couldn't set original dataframe result. Further, this is why iloc didn't raised issue when it got str index. Explained in #2.
Instead you use iloc - potentially loc with str - like this:
>>> df
A B C
0 1 10 100
1 2 20 200
2 3 30 300
>>> df.iloc[1, 2]
200
>>>df.iloc[[1, 2], [1, 2]]
B C
1 20 200
2 30 300
>>> df.iloc[1:3, 1:3]
B C
1 20 200
2 30 300
>>> df.iloc[:, 1:3]
B C
0 10 100
1 20 200
2 30 300
# ..and so on
issue #2
If you fix issue #1 then you'll see following error:
result.iloc[[i][c]] = df[c].shift(i).max()
TypeError: list indices must be integers or slices, not str
Also from their document:
property DataFrame.iloc: Purely integer-location based indexing for selection by position.
At for c in df.columns: You're passing column name A, B which is str, not int. Use loc instead for str column indices.
This didn't raise TypeError due to issue #1 - as c was passed as argument of __setitem__().
Issue #3
Normally dataframe cannot be enlarged without special functions like combine.
# using same df from #1
>>> df.iloc[1, 3] = 300
Traceback (most recent call last):
File "~\pandas\core\indexing.py", line 1394, in _has_valid_setitem_indexer
raise IndexError("iloc cannot enlarge its target object")
IndexError: iloc cannot enlarge its target object
Easier fix would be using dict and convert to DataFrame when manipulation is complete. Or just creating DataFrame to match or have a larger size at firsthand:
>>> df2 = pd.DataFrame(index=range(4), columns=range(3))
>>> df2
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Combining all, correct fix would be:
import pandas as pd
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8]})
result = pd.DataFrame(index=df.index, columns=df.columns)
for col in df.columns:
for index in df.index:
result.loc[index, col] = df[col].shift(index).max()
print(result)
Output:
A B
0 3 8
1 3 7
2 3 6
3 3 5

How to replace selected rows of pandas dataframe with a np array, sequentially?

I have a pandas dataframe
A B C
0 NaN 2 6
1 3.0 4 0
2 NaN 0 4
3 NaN 1 2
where I have a column A that has NaN values in some rows (not necessarily consecutive).
I want to replace these values not with a constant value (which pd.fillna does), but rather with the values from a numpy array.
So the desired outcome is:
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
I'm not sure the .replace method will help here as well, since that seems to replace value <-> value via dictionary. Whereas here I want to sequentially change NaN to its corresponding value (by index) in the np array.
I tried:
MWE:
huh = pd.DataFrame([[np.nan, 2, 6],
[3, 4, 0],
[np.nan, 0, 4],
[np.nan, 1, 2]],
columns=list('ABC'))
huh.A[huh.A.isnull()] = np.array([1,5,7]) # what i want to do, but this gives error
gives the error
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
'''
I read the docs but I can't understand how to do this with .loc.
How do I do this properly, preferably without a for loop?
Other info:
The number of elements in the np array will always match the number of NaN in the dataframe, so your answer does not need to check for this.
You are really close, need DataFrame.loc for avoid chained assignments:
huh.loc[huh.A.isnull(), 'A'] = np.array([1,5,7])
print (huh)
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2
zip
This should account for uneven lengths
m = huh.A.isna()
a = np.array([1, 5, 7])
s = pd.Series(dict(zip(huh.index[m], a)))
huh.fillna({'A': s})
A B C
0 1.0 2 6
1 3.0 4 0
2 5.0 0 4
3 7.0 1 2

Pandas Rows with missing values in multiple columns

I have a dataframe with columns age, date and location.
I would like to count how many rows are empty across ALL columns (not some but all in the same time). I have the following code, each line works independently, but how do I say age AND date AND location isnull?
df['age'].isnull().sum()
df['date'].isnull().sum()
df['location'].isnull().sum()
I would like to return a dataframe after removing the rows with missing values in ALL these three columns, so something like the following lines but combined in one statement:
df.mask(row['location'].isnull())
df[np.isfinite(df['age'])]
df[np.isfinite(df['date'])]
You basically can use your approach, but drop the column indices:
df.isnull().sum().sum()
The first .sum() returns a per-column value, while the second .sum() will return the sum of all NaN values.
Similar to Vaishali's answer, you can use df.dropna() to drop all values that are NaN or None and only return your cleaned DataFrame.
In [45]: df = pd.DataFrame({'age': [1, 2, 3, np.NaN, 4, None], 'date': [1, 2, 3, 4, None, 5], 'location': ['a', 'b', 'c', None, 'e', 'f']})
In [46]: df
Out[46]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
3 NaN 4.0 None
4 4.0 NaN e
5 NaN 5.0 f
In [47]: df.isnull().sum().sum()
Out[47]: 4
In [48]: df.dropna()
Out[48]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
You can find the no of rows with all NaNs by
len(df) - len(df.dropna(how = 'all'))
and drop by
df = df.dropna(how = 'all')
This will drop the rows with all the NaN values

Resources