SUOD Model gives ValueError: Input Contains Nan - python-3.x

I am running SUOD from pyod which is ensemble method and received this error.
The models that I am running are Iforest, COPOD and ECOD.
Running these models individually does not say that the data has nan values in it. Also I have already verified if any of the columns has nan values and it does not have any. The data is one hot encoded
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 1.0min remaining: 0.0s
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 1.0min finished
[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 5.8s remaining: 0.0s
[Parallel(n_jobs=2)]: Done 2 out of 2 | elapsed: 5.8s finished
Traceback (most recent call last):
File "ensemble.py", line 76, in <module>
clf.fit(x_train_scaled)
File "/home/ubuntu/thesis/lib/python3.8/site-packages/pyod/models/suod.py", line 220, in fit
decision_score_mat, self.score_scalar_ = standardizer(
File "/home/ubuntu/thesis/lib/python3.8/site-packages/pyod/utils/utility.py", line 152, in standardizer
X = check_array(X)
File "/home/ubuntu/thesis/lib/python3.8/site-packages/sklearn/utils/validation.py", line 919, in check_array
_assert_all_finite(
File "/home/ubuntu/thesis/lib/python3.8/site-packages/sklearn/utils/validation.py", line 161, in _assert_all_finite
raise ValueError(msg_err)
ValueError: Input contains NaN.
and this is my code
train_data.dropna(axis=0)
test1_data.dropna(axis=0)
test2_data.dropna(axis=0)
mm_scaler = MinMaxScaler()
x_train_scaled = mm_scaler.fit_transform(train_data)
x_test2_scaled = mm_scaler.transform(test2_data)
x_test1_scaled = mm_scaler.transform(test1_data)
detector_list = [COPOD(), IForest(n_estimators=100,max_samples=10000, max_features=10,
bootstrap=True, n_jobs=-1, random_state=42),
IForest(n_estimators=200,max_samples=10000, max_features=10,
bootstrap=True, n_jobs=-1, random_state=42), ECOD(contamination=0.001)]
clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
verbose=False)
clf.fit(x_train_scaled)
train_pred = clf.predict(x_train_scaled)
test_pred1 = clf.predict(x_test1_scaled)
test_pred2 = clf.predict(x_test2_scaled)
Thing that I have tried
SimpleImputer
dropping nan rows.
adding the mock patch

As error output, you need to handle NaN values. dropna method return a new dataframe. If you want modify it, set parameter inplace to true and do operation inplace (return None),
inplace : boolean, default False
So to modify it in place data.dropna(axis=0, how='any', inplace=True)
Also possible method to handle NaN values (this is optional and if applicable to your problem or a data mining related) is to convert NaN inputs to the median of the column df = df.fillna(df.mean()).
Another case (uncommon) is that your dataframe contains nan values represented as string type or "NaN", then functions to manage nan values wont work, in that case you need to use something like df.replace("NaN", numpy.nan) and drop.

Related

Do Polars dataframes need to be formatted a certain way to be used with scikit-learn models?

I've been working through a notebook on Kaggle that used Pandas, and I wanted to refactor it to Polars.
The dataframes I'm working with look like this:
X_train
Pclass (i64)
Sex (i16)
Age (i32)
Fare (i16)
Embarked (i16)
title (i16)
family_size (i64)
is_alone (i64)
age*class (i64)
3
0
1
0
1
1
2
1
0
...
...
...
...
...
...
...
...
...
shape: (891,9)
y_train
Survived (i64)
0
1
...
shape: (891,1)
Kaggle has me creating a logistic regression with the following code:
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)
It's my understanding that Polars dataframes don't work with sklearn, so I modified X_train and y_train to be numpy ndarrays like so:
logreg.fit(X_train.to_numpy(),y_train.to_numpy())
but then I get the following error:
>>> logreg.fit(X_train.to_numpy(),y_train.to_numpy())
/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py:993: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
LogisticRegression()
I was going to try saying y_train.transpose(), but there's no mention of needing to reshape the Pandas dataframe so I'm wondering if a transpose is really necessary, or if I'm doing something else wrong.
Edit:
I added the numpy.ravel() function like this logreg.fit(X_train.to_numpy(),y_train.to_numpy().ravel()) and it seems to have worked, but now when I say
y_pred = logreg.predict(X_test.to_numpy()) I get the following error
>>>y_pred= logreg.predict(X_test.to_numpy())
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/linear_model/_base.py", line 425, in predict
scores = self.decision_function(X)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/linear_model/_base.py", line 407, in decision_function
X = self._validate_data(X, accept_sparse="csr", reset=False)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/base.py", line 566, in _validate_data
X = check_array(X, **check_params)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py", line 800, in check_array
_assert_all_finite(array, allow_nan=force_all_finite == "allow-nan")
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/sklearn/utils/validation.py", line 114, in _assert_all_finite
raise ValueError(
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
My X_test df is the same as X_train with the addition of a Survived feature.
Survived (i64)
Pclass (i64)
Sex (i16)
Age (i32)
Fare (i16)
Embarked (i16)
title (i16)
family_size (i64)
is_alone (i64)
age*class (i64)
null
3
0
1
0
1
1
2
1
0
...
...
...
...
...
...
...
...
...
...
shape (418,10)

Python Pandas indexing provides KeyError: (slice(None, None, None), )

I am indexing and slicing my data using Pandas in Python3 to calculate spatial statistics.
When I am running a for loop over the range of latitude and longitude using .loc, gives an error KeyError: (slice(None, None, None), ) for the particular set of latitude and longitude for what no values are available in the input file. Instead of skipping those values, it gives an error and stops running the code. Following is my code.
import numpy as np
import pandas as pd
from scipy import stats
filename='input.txt'
df = pd.read_csv(filename,delim_whitespace=True, header=None, names = ['year','month','lat','lon','aod'], index_col = ['year','month','lat','lon'])
idx=pd.IndexSlice
for i in range (1, 13):
for lat0 in N.arange(0.,40.25,0.25,dtype=float):
for lon0 in N.arange(20.0,75.25,0.25,dtype=float):
tmp = df.loc[idx[:,i,lat0,lon0],:]
if (len(tmp) <= 0):
continue
tmp2 = tmp.index.tolist()
In the code above, if I run for tmp = df.loc[idx[:,1,0.0,34.0],:], it works well and provides the following output, which I used for the further calculation.
aod
year month lat lon
2003 1 0.0 34.0 0.032000
2006 1 0.0 34.0 0.114000
2007 1 0.0 34.0 0.035000
2008 1 0.0 34.0 0.026000
2011 1 0.0 34.0 0.097000
2012 1 0.0 34.0 0.106333
2013 1 0.0 34.0 0.081000
2014 1 0.0 34.0 0.038000
2015 1 0.0 34.0 0.278500
2016 1 0.0 34.0 0.033000
2017 1 0.0 34.0 0.036333
2019 1 0.0 34.0 0.064333
2020 1 0.0 34.0 0.109500
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 925, in __getitem__
return self._getitem_tuple(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1100, in _getitem_tuple
return self._getitem_lowerdim(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 822, in _getitem_lowerdim
return self._getitem_nested_tuple(tup)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 906, in _getitem_nested_tuple
obj = getattr(obj, self.name)._getitem_axis(key, axis=axis)
File "/usr/lib/python3/dist-packages/pandas/core/indexing.py", line 1157, in _getitem_axis
locs = labels.get_locs(key)
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3347, in get_locs
indexer = _update_indexer(
File "/usr/lib/python3/dist-packages/pandas/core/indexes/multi.py", line 3296, in _update_indexer
raise KeyError(key)
KeyError: (slice(None, None, None), 1, 0.0, 32.75)
I tried to replace .loc with .iloc, but it came out with a too many indexers error. However, I tried solutions from internet using .to_numpy(), .values and .as_matrix(), but nothing work.
But, a same code I run for tmp = df.loc[idx[:,1,0.0,32.75],:], for the respective latitude and longitude no values available in the input file. Instead of skipping those, it gives me the following error:
The idiomatic Pandas solution would be to write this as a groupby. Example:
# split df into groups by the keys month, lat, and lon
for index, tmp in df.groupby(['month','lat','lon']):
# tmp is a dataframe where all rows have identical month, lat, and lon values
# ... do something with the tmp dataframe ...
This has three benefits.
Speed. A groupby will be faster because it only needs to loop over the dataframe once, rather than searching the whole dataframe for everything matching the first group, then searching for the second group, etc.
Simplicity.
Robustness. From a robustness perspective, if a dataframe doesn't have, for example, any rows matching "month=1,lat=0.0,lon=32.75", then it will not create that group.
More information: User guide on grouping
Remark about groupby aggregation functions
You'll also sometimes see groupby used with aggregation functions. For example, suppose you wanted to get the sum of each column within each group.
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
These aggregation functions are faster and easier to use, but sometimes I need something that is custom and unusual, so I'll write a loop. But if you're doing something common, like getting the average of a group, consider looking for an aggregation function.

OneHotEncoder failing after combining dataframes

I have a model that runs successfully.
When I tried to predict using it, it was failing due to the fact that after OneHotEncoding, the test set had more columns than the train.
After some reading I found where I need to concat the two df's first, OneHotEncode, then split them apart.
Added a 'temp' column to the train data set with value 'train'.
Added a 'temp' column to the test data set with value 'test'.
This is so that I can split the df apart later using boolean indexing like this:
X = temp_df[temp_df['temp'] == 'train']
X2 = temp_df[temp_df['temp'] == 'test']
Vertically concat the two df's.
Verify the shape of the new combined df.
Change all columns to type 'category' except 'temp', which is object:
basin category
region category
lga category
extraction_type_class category
management category
quality_group category
quantity category
source category
waterpoint_type category
cluster category
temp object
Now I am simply trying to OneHotEncode like I did before. I choose only categorical columns:
cat_ix = temp_df.select_dtypes(include=['category']).columns
And I try to apply with:
ct = ColumnTransformer([('o', OneHotEncoder(), cat_ix)], remainder='passthrough')
temp_df = ct.fit_transform(temp_df)
It fails on the temp_df = ct.fit_transform(temp_df) line.
These identical steps worked perfectly before I added the temp column and concat'd the two df's.
The exact error:
Traceback (most recent call last):
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 778, in _hstack
converted_Xs = [
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 779, in <listcomp>
check_array(X, accept_sparse=True, force_all_finite=False)
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\utils\validation.py", line 738, in check_array
array = np.asarray(array, order=order, dtype=dtype)
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\pandas\core\generic.py", line 1993, in __array__
return np.asarray(self._values, dtype=dtype)
ValueError: could not convert string to float: 'train'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\Mark\AppData\Local\Programs\Python\Python38\lib\site-packages\sklearn\compose\_column_transformer.py", line 783, in _hstack
raise ValueError(
ValueError: For a sparse output, all columns should be a numeric or convertible to a numeric.
Why is it complaining about 'train'? That is in the 'temp' column which is being excluded.
Note that the traceback doesn't reference OneHotEncoder, it's all the ColumnTransformer. You're trying to pass through the temp column, which gets tacked onto the one-hot-encoded sparse matrix in the method _hstack, and the second error message is the more relevant one. It cannot stack a string-type array onto a numeric sparse array (which leads to the first error message).
If the sparse matrix isn't too large, you can just force it to be dense by using sparse_threshold=0 in the ColumnTransformer or sparse=False in the OneHotEncoder. If it is too large for memory (or you'd prefer the sparse matrices), you could use a 0/1 indicator for the train/test split instead of the strings "train", "test".

Unpickling dictionary that holds pandas dataframes throws AttributeError: 'Dataframe' object has no attribute '_data'

I have a class that performs analyses and attaches the results, which are pandas dataframes, as object attributes:
>>> print(test.image.locate_DF)
y x mass ... raw_mass ep frame
0 60.177142 59.788709 33.433414 ... 242.080256 NaN 0
1 60.651991 59.773904 33.724308 ... 242.355784 NaN 1
2 60.790437 60.190234 31.117164 ... 236.276671 NaN 2
3 60.771933 60.048123 33.558372 ... 240.981395 NaN 3
4 60.251282 59.775139 31.881009 ... 239.239022 NaN 4
... ... ... ... ... ... ... ...
7212 68.186380 76.477449 18.122817 ... 176.523091 NaN 9410
7213 68.764444 76.574091 17.486454 ... 173.448306 NaN 9415
7214 68.191152 76.473477 17.402975 ... 172.848119 0.868326 9429
7215 67.034103 76.025885 17.010951 ... 170.928067 -0.600854 9431
7216 68.583276 75.309592 17.852992 ... 178.271558 NaN 9432
Subsequently, I save all the important object attributes in a dictionary, and pickle it for later use:
def save_parameters(self, filepath):
param_dict = {}
try:
self.image.locate_DF
except AttributeError:
pass
else:
param_dict['optical_locate_DF'] = self.image.locate_DF
with open(filepath, 'wb') as handle:
pickle.dump(param_dict, handle, 5)
When trying to load that pickled file, I have no problem at all, the dataframe loads perfectly:
>>> test.save_parameters('test.pickle')
>>> with open('test.pickle', 'rb') as handle:
... result = pickle.load(handle)
...
>>> print(result.keys())
dict_keys(['optical_path', 'optical_feature_diameter', 'optical_feature_minmass', 'optical_locate_DF', 'electrical_path', 'electrical_raw_data', 'electrical_processed_data', 'electrical_mean_voltage'])
>>> print(result['optical_locate_DF'])
y x mass ... raw_mass ep frame
0 60.177142 59.788709 33.433414 ... 242.080256 NaN 0
1 60.651991 59.773904 33.724308 ... 242.355784 NaN 1
2 60.790437 60.190234 31.117164 ... 236.276671 NaN 2
3 60.771933 60.048123 33.558372 ... 240.981395 NaN 3
4 60.251282 59.775139 31.881009 ... 239.239022 NaN 4
... ... ... ... ... ... ... ...
7212 68.186380 76.477449 18.122817 ... 176.523091 NaN 9410
7213 68.764444 76.574091 17.486454 ... 173.448306 NaN 9415
7214 68.191152 76.473477 17.402975 ... 172.848119 0.868326 9429
7215 67.034103 76.025885 17.010951 ... 170.928067 -0.600854 9431
7216 68.583276 75.309592 17.852992 ... 178.271558 NaN 9432
[7217 rows x 9 columns]
However, after running my analysis on a bunch of these files on a hpc, and then trying to open that same pickled file (it's named differently now but it's the same file as shown above, with the same analysis performed on it), I get thrown an attribute error by pandas. It states that the dataframe has no '_data' attribute. The dictionary has the same keys and the keys that are not a dataframe are printed without any issues:
>>> resultfile = '../results/diam_15_minmass_17_dist_50_mem_5000_tracklength_500/R9_DNA_50mV_001.pickle'
>>> with open(resultfile, 'rb') as handle:
... result = pickle.load(handle)
...
>>> print(result.keys())
dict_keys(['optical_path', 'optical_feature_diameter', 'optical_feature_minmass', 'optical_locate_DF', 'optical_tracking_distance', 'optical_tracking_memory', 'optical_tracking_DF', 'optical_kinetics_DF', 'electrical_path', 'electrical_raw_data', 'electrical_processed_data', 'electrical_mean_voltage'])
>>> print(result['optical_locate_DF'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/stevenvanuytsel/miniconda3/envs/simultaneous_measurements/lib/python3.8/site-packages/pandas/core/frame.py", line 680, in __repr__
self.to_string(
File "/Users/stevenvanuytsel/miniconda3/envs/simultaneous_measurements/lib/python3.8/site-packages/pandas/core/frame.py", line 801, in to_string
formatter = fmt.DataFrameFormatter(
File "/Users/stevenvanuytsel/miniconda3/envs/simultaneous_measurements/lib/python3.8/site-packages/pandas/io/formats/format.py", line 593, in __init__
self.max_rows_displayed = min(max_rows or len(self.frame), len(self.frame))
File "/Users/stevenvanuytsel/miniconda3/envs/simultaneous_measurements/lib/python3.8/site-packages/pandas/core/frame.py", line 1041, in __len__
return len(self.index)
File "/Users/stevenvanuytsel/miniconda3/envs/simultaneous_measurements/lib/python3.8/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
File "pandas/_libs/properties.pyx", line 63, in pandas._libs.properties.AxisProperty.__get__
File "/Users/stevenvanuytsel/miniconda3/envs/simultaneous_measurements/lib/python3.8/site-packages/pandas/core/generic.py", line 5270, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute '_data'
I've looked into the pickle manual, and through a bunch of SO questions, but I can't seem to find out what is going wrong here. Does anyone have an idea how to fix this, and also whether I can still access that data?
I had the same problem. I generated a Pandas dataframe in an environment with Pandas 1.1.1 and saved it to a pickle file.
with open('file.pkl', 'wb') as f:
pickle.dump(data_frame_object, f)
After unpickling it in another session and printing the dataframe I got the same error. Some testing in different environments showed the following pattern:
environment with Pandas >= 1.1.0: works
environment with Pandas == 1.0.5: error message as above
environment with Pandas == 1.0.3: Kernel crashes
I got the same error using the HDF5 format so it seems to be a compatibility issue with the dataframe and different Pandas versions.
Updating Pandas to 1.1.1 in the affected environments solved the issue for me.
After a long and painful process of cross-checking module versions, I found out that this error was caused due to an update in the pandas version. My mac still ran pandas 1.0.5, whereas the hpc runs pandas 1.1.0. Apparently, there is a mismatch between the two (unsure whether it's just after pickling or also for other file formats used to save).
Maybe the problem has been solved.
Emmm, but I still want to add some comments.
I save the pkl file on the server, but when I load it on my MAC, it crashed, showing 'Dataframe' object has no attribute '_data'
Finally, I found that pandas on my Mac is 1.0.5 but 1.1.5 on the server. When I updated it to the latest, it just worked.

applying a lambda function to pandas dataframe

First time posting on stackoverflow, so bear with me if I'm making some faux pas please :)
I'm trying to calculate the distance between two points, using geopy, but I can't quite get the actual application of the calculation to work.
Here's the head of the dataframe I'm working with (there are some missing values later in the dataframe, not sure if this is the issue or how to handle it in general):
start lat start long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
I've set up a function:
def dist_calc(st_lat, st_long, fin_lat, fin_long):
from geopy.distance import vincenty
start = (st_lat, st_long)
end = (fin_lat, fin_long)
return vincenty(start, end).miles
This one works fine when given manual input.
However, when I try to apply() the function, I run into trouble with the below code:
distances = df.apply(lambda row: dist_calc(row[-4], row[-3], row[-2], row[-1]), axis=1)
I'm fairly new to python, any help will be much appreciated!
Edit: error message:
distances = df.apply(lambda row: dist_calc2(row[-4], row[-3], row[-2], row[-1]), axis=1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4262, in apply
ignore_failures=ignore_failures)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/frame.py", line 4358, in _apply_standard
results[i] = func(v)
File "<stdin>", line 1, in <lambda>
File "<stdin>", line 5, in dist_calc2
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 322, in __init__
super(vincenty, self).__init__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 115, in __init__
kilometers += self.measure(a, b)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/geopy/distance.py", line 414, in measure
u_sq = cos_sq_alpha * (major ** 2 - minor ** 2) / minor ** 2
UnboundLocalError: ("local variable 'cos_sq_alpha' referenced before assignment", 'occurred at index 10')
The default settings for pandas functions typically used to import text data like this (pd.read_table() etc) will interpret the spaces in the first 2 column names as separators, so you'll end up with 6 columns instead of 4, and your data will be misaligned:
In [23]: df = pd.read_clipboard()
In [24]: df
Out[24]:
start lat start.1 long end_lat end_long
0 0 38.902760 -77.038630 38.880300 -76.986200 NaN
1 2 38.895914 -77.026064 38.915400 -77.044600 NaN
2 3 38.888251 -77.049426 38.895914 -77.026064 NaN
3 4 38.892300 -77.043600 38.888251 -77.049426 NaN
In [25]: df.columns
Out[25]: Index(['start', 'lat', 'start.1', 'long', 'end_lat', 'end_long'], dtype='object')
Notice column names are wrong, the last column is full of NaNs, etc. If I apply your function to the dataframe in this form, I get the same error as you did.
Its usually better to try to fix this before it gets imported as a dataframe. I can think of 2 methods:
clean the data before importing, for example copy it into an editor and replace the offending spaces with underscores. This is the easiest.
use a regex to fix it during import. This may be necessary if the dataset is very large, or its is pulled from a website and has to be refreshed regularly.
Here's an example of case (2):
In [35]: df = pd.read_clipboard(sep=r'\s{2,}|\s(?=-)', engine='python')
In [36]: df = df.rename_axis({'start lat': 'start_lat', 'start long': 'start_long'}, axis=1)
In [37]: df
Out[37]:
start_lat start_long end_lat end_long
0 38.902760 -77.038630 38.880300 -76.986200
2 38.895914 -77.026064 38.915400 -77.044600
3 38.888251 -77.049426 38.895914 -77.026064
4 38.892300 -77.043600 38.888251 -77.049426
The specified that separators must contain either 2+ whitespaces characters, or 1 whitespace followed by a hyphen (minus sign). Then I rename the columns to what i assume are the expected values.
From this point your function / apply works fine, but i've changed it a little:
PEP8 recommends putting imports at the top of each file, rather than in a function
Extracting the columns by name is more robust, and would have given a much more understandable error than the weird error thrown by geopy.
For example:
In [51]: def dist_calc(row):
...: start = row[['start_lat','start_long']]
...: end = row[['end_lat', 'end_long']]
...: return vincenty(start, end).miles
...:
In [52]: df.apply(lambda row: dist_calc(row), axis=1)
Out[52]:
0 3.223232
2 1.674780
3 1.365851
4 0.420305
dtype: float64

Resources