How to deal with 5 or 6 digit values for Latitude and Longitude? - python-3.x

I am trying to read in a dataframe and the latitude and longitude doesn't seem accurate. And this is not only for few rows but an entire dataframe with more than 100k rows.
screenshot of dataframe
How do you handle such data?

It looks like your source could be using 99999 instead of NaN. I'd replace these with NaN (missing):
In [11]: df = pd.DataFrame([[1, 99999.0], [2, 4]], columns=['A', 'B'])
In [12]: df[['B']] = df[['B']].replace(99999., np.nan)
In [13]: df
Out[13]:
A B
0 1 NaN
1 2 4.0
i.e.
df[['Latitude', 'Longitude']] = df[['Latitude', 'Longitude']].replace(99999., np.nan)
Note: This might replace some geo locations that are legitimately 99999 but that's very unlikely!

Related

Assign values to pandas column based on condition [duplicate]

I need to forward fill values in a column of a dataframe within groups. I should note that the first value in a group is never missing by construction. I have the following solutions at the moment.
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, 2, np.nan, np.nan]})
# desired output
a b
1 1
1 1
2 2
2 2
2 2
Here are the three solutions that I've tried so far.
# really slow solutions
df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
df['b'] = df.groupby('a')['b'].fillna(method='ffill')
# much faster solution, but more memory intensive and ugly all around
tmp = df.drop_duplicates('a', keep='first')
df.drop('b', inplace=True, axis=1)
df = df.merge(tmp, on='a')
All three of these produce my desired output, but the first two take a really long time on my data set, and the third solution is more memory intensive and feels rather clunky. Are there any other ways to forward fill a column?
You need to sort by both columns df.sort_values(['a', 'b']).ffill() to ensure robustness. If an np.nan is left in the first position within a group, ffill will fill that with a value from the prior group. Because np.nan will be placed at the end of any sort, sorting by both a and b ensures that you will not have np.nan at the front of any group. You can then .loc or .reindex with the initial index to get back your original order.
This will obviously be a tad slower than the other proposals... However, I contend it will be correct where the others are not.
demo
Consider the dataframe df
df = pd.DataFrame({'a': [1,1,2,2,2], 'b': [1, np.nan, np.nan, 2, np.nan]})
print(df)
a b
0 1 1.0
1 1 NaN
2 2 NaN
3 2 2.0
4 2 NaN
Try
df.sort_values('a').ffill()
a b
0 1 1.0
1 1 1.0
2 2 1.0 # <--- this is incorrect
3 2 2.0
4 2 2.0
Instead do
df.sort_values(['a', 'b']).ffill().loc[df.index]
a b
0 1 1.0
1 1 1.0
2 2 2.0
3 2 2.0
4 2 2.0
special note
This is still incorrect if an entire group has missing values
Using ffill() directly will give the best results. Here is the comparison
%timeit df.b.ffill(inplace = True)
best of 3: 311 µs per loop
%timeit df['b'] = df.groupby('a')['b'].transform(lambda x: x.fillna(method='ffill'))
best of 3: 2.34 ms per loop
%timeit df['b'] = df.groupby('a')['b'].fillna(method='ffill')
best of 3: 4.41 ms per loop
what about this
df.groupby('a').b.transform('ffill')

Add Multiindex Dataframe and corresponding Series

I am failing to add a multiindex dataframe and a corresponding series. E.g.,
df = pd.DataFrame({
'a': [0, 0, 1, 1], 'b': [0, 1, 0, 1],
'c': [1, 2, 3, 4], 'd':[1, 1, 1, 1]}).set_index(['a', 'b'])
# Dataframe might contain records that are not in the series and vice versa
s = df['d'].iloc[1:]
df + s
produces
ValueError: cannot join with no overlapping index names
Does anyone know how to resolve this? I can work around the issue by adding each column separately, using e.g.
df['d'] + s
But I would like to add the two in a single operation. Any help is much appreciated.
By default, + tries to align along columns, the following would work with +:
s = df.iloc[:, 1:]
df + s
# c d
#a b
#0 0 NaN 2
# 1 NaN 2
#1 0 NaN 2
# 1 NaN 2
In your case, you need to align along index. You can explicitly specify axis=0 with add method for that:
df.add(s, axis=0)
# c d
#a b
#0 0 NaN NaN
# 1 3.0 2.0
#1 0 4.0 2.0
# 1 5.0 2.0

How to fill dataframe column as tuple of other columns values using np.where?

I have a dataframe as follows
id Domain City
1 DM Pune
2 VS Delhi
I want to create a new column which will contain tuple of column values id & Domain,
e.g
id Domain City New_Col
1 DM Pune (1,DM)
2 VS Delhi (2,VS)
I know I can create it easily using apply & lambda as follows:
df['New_Col'] = df.apply(lambda r:tuple(r[bkeys]),axis=1) ##here bkeys = ['id','Domain']
However I this takes hell lot of time for larger dataframes having > 100k records. Hence I want to use np.where like this
df['New_Col'] = np.where(True, tuple(df[bkeys]), '')
But this doesn't work, it gives values like: ('id','Domain')
Any suggestions?
Try this:
df.assign(new_col = df[['id','Domain']].agg(tuple, axis=1))
Output:
id Domain City new_col
0 1 DM Pune (1, DM)
1 2 VS Delhi (2, VS)
Something or other is giving people a wrong idea of what np.where does. I've seen similar error in other questions.
Let's make your dataframe:
In [2]: import pandas as pd
In [3]: df = pd.DataFrame([[1,'DM','Pune'],[2,'VS','Delhi']],columns=['id','Domain','City'])
In [4]: df
Out[4]:
id Domain City
0 1 DM Pune
1 2 VS Delhi
Your apply expression:
In [5]: bkeys = ['id','Domain']
In [6]: df.apply(lambda r:tuple(r[bkeys]),axis=1)
Out[6]:
0 (1, DM)
1 (2, VS)
dtype: object
what's happening here? apply is iterating on the rows of df. r is one row.
So the first row:
In [9]: df.iloc[0]
Out[9]:
id 1
Domain DM
City Pune
Name: 0, dtype: object
index with bkeys:
In [10]: df.iloc[0][bkeys]
Out[10]:
id 1
Domain DM
Name: 0, dtype: object
and make a tuple from that:
In [11]: tuple(df.iloc[0][bkeys])
Out[11]: (1, 'DM')
But what do we get when indexing the whole dataframe:
In [12]: df[bkeys]
Out[12]:
id Domain
0 1 DM
1 2 VS
In [15]: tuple(df[bkeys])
Out[15]: ('id', 'Domain')
np.where is a function; it is not an iterator. The interpreter evaluates each of its arguments, and passes them to the function.
In [16]: np.where(True, tuple(df[bkeys]), '')
Out[16]: array(['id', 'Domain'], dtype='<U6')
This is what you tried to assign to the new column.
In [17]: df
Out[17]:
id Domain City New_Col
0 1 DM Pune id
1 2 VS Delhi Domain
This assignment only works because the tuple has 2 elements, and df has 2 rows. Otherwise you'd get an error.
np.where is not a magical way of speeding up a dataframe apply. It's a way of creating an array of values, which, if the right size can be assigned to a dataframe column (series).
We could create a numpy array from the selected columns:
In [31]: df[bkeys].to_numpy()
Out[31]:
array([[1, 'DM'],
[2, 'VS']], dtype=object)
and from that get a list of lists, and assign that to a new column:
In [32]: df[bkeys].to_numpy().tolist()
Out[32]: [[1, 'DM'], [2, 'VS']]
In [33]: df['New_Col'] = _
In [34]: df
Out[34]:
id Domain City New_Col
0 1 DM Pune [1, DM]
1 2 VS Delhi [2, VS]
If you really want tuples, the sublists will have to be converted:
In [35]: [tuple(i) for i in df[bkeys].to_numpy().tolist()]
Out[35]: [(1, 'DM'), (2, 'VS')]
Another way of making a list of tuples (which works because array records display as tuples:
In [42]: df[bkeys].to_records(index=False).tolist()
Out[42]: [(1, 'DM'), (2, 'VS')]

Manipulate values in pandas DataFrame columns based on matching IDs from another DataFrame

I have two dataframes like the following examples:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, 1],
'c': [np.nan, 1, 1]})
df_id = pd.DataFrame({'b': ['50', '4954', '93920', '20'],
'c': ['123', '100', '6', np.nan]})
print(df)
a b c
0 20 1.0 NaN
1 50 NaN 1.0
2 100 1.0 1.0
print(df_id)
b c
0 50 123
1 4954 100
2 93920 6
3 20 NaN
For each identifier in df['a'], I want to null the value in df['b'] if there is no matching identifier in any row in df_id['b']. I want to do the same for column df['c'].
My desired result is as follows:
result = pd.DataFrame({'a': ['20', '50', '100'], 'b': [1, np.nan, np.nan],
'c': [np.nan, np.nan, 1]})
print(result)
a b c
0 20 1.0 NaN
1 50 NaN NaN # df_id['c'] did not contain '50'
2 100 NaN 1.0 # df_id['b'] did not contain '100'
My attempt to do this is here:
for i, letter in enumerate(['b','c']):
df[letter] = (df.apply(lambda x: x[letter] if x['a']
.isin(df_id[letter].tolist()) else np.nan, axis = 1))
The error I get:
AttributeError: ("'str' object has no attribute 'isin'", 'occurred at index 0')
This is in Python 3.5.2, Pandas version 20.1
You can solve your problem using this instead:
for letter in ['b','c']: # took off enumerate cuz i didn't need it here, maybe you do for the rest of your code
df[letter] = df.apply(lambda row: row[letter] if row['a'] in (df_id[letter].tolist()) else np.nan,axis=1)
just replace isin with in.
The problem is that when you use apply on df, x will represent df rows, so when you select x['a'] you're actually selecting one element.
However, isin is applicable for series or list-like structures which raises the error so instead we just use in to check if that element is in the list.
Hope that was helpful. If you have any questions please ask.
Adapting a hard-to-find answer from Pandas New Column Calculation Based on Existing Columns Values:
for i, letter in enumerate(['b','c']):
mask = df['a'].isin(df_id[letter])
name = letter + '_new'
# for some reason, df[letter] = df.loc[mask, letter] does not work
df.loc[mask, name] = df.loc[mask, letter]
df[letter] = df[name]
del df[name]
This isn't pretty, but seems to work.
If you have a bigger Dataframe and performance is important to you, you can first build a mask df and then apply it to your dataframe.
First create the mask:
mask = df_id.apply(lambda x: df['a'].isin(x))
b c
0 True False
1 True False
2 False True
This can be applied to the original dataframe:
df.iloc[:,1:] = df.iloc[:,1:].mask(~mask, np.nan)
a b c
0 20 1.0 NaN
1 50 NaN NaN
2 100 NaN 1.0

Pandas Rows with missing values in multiple columns

I have a dataframe with columns age, date and location.
I would like to count how many rows are empty across ALL columns (not some but all in the same time). I have the following code, each line works independently, but how do I say age AND date AND location isnull?
df['age'].isnull().sum()
df['date'].isnull().sum()
df['location'].isnull().sum()
I would like to return a dataframe after removing the rows with missing values in ALL these three columns, so something like the following lines but combined in one statement:
df.mask(row['location'].isnull())
df[np.isfinite(df['age'])]
df[np.isfinite(df['date'])]
You basically can use your approach, but drop the column indices:
df.isnull().sum().sum()
The first .sum() returns a per-column value, while the second .sum() will return the sum of all NaN values.
Similar to Vaishali's answer, you can use df.dropna() to drop all values that are NaN or None and only return your cleaned DataFrame.
In [45]: df = pd.DataFrame({'age': [1, 2, 3, np.NaN, 4, None], 'date': [1, 2, 3, 4, None, 5], 'location': ['a', 'b', 'c', None, 'e', 'f']})
In [46]: df
Out[46]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
3 NaN 4.0 None
4 4.0 NaN e
5 NaN 5.0 f
In [47]: df.isnull().sum().sum()
Out[47]: 4
In [48]: df.dropna()
Out[48]:
age date location
0 1.0 1.0 a
1 2.0 2.0 b
2 3.0 3.0 c
You can find the no of rows with all NaNs by
len(df) - len(df.dropna(how = 'all'))
and drop by
df = df.dropna(how = 'all')
This will drop the rows with all the NaN values

Resources