intersection of two geopandas GeoSeries gives warning "The indices of the two GeoSeries are different." and few matches - geospatial

I am using geopandas for finding intersections between points and polygons.
When I use the following:
intersection_mb = buffers_df.intersection(rest_VIC)
I get this output with a warning basically saying there are no intersections:
0 None
112780 None
112781 None
112782 None
112784 None
...
201314 None
201323 None
201403 None
201404 None
201444 None
Length: 3960, dtype: geometry
Warning message:
C:\Users\Name\Anaconda3\lib\site-packages\geopandas\base.py:31: UserWarning: The indices of the two GeoSeries are different.
warn("The indices of the two GeoSeries are different.")
I looked for any suggestions and found that I could solve by setting crs for both geoseries for which the intersection is to be performed on, but it did not work.
rest_VIC = rest_VIC.set_crs(epsg=4326, allow_override=True)
buffers_df = buffers_df.set_crs(epsg=4326, allow_override=True)
Any suggestions will be helpful. Thanks.

geopandas.GeoSeries.intersection is an element-wise operation. From the intersection docs:
The operation works on a 1-to-1 row-wise manner
There is an optional align argument which determines whether the series should be first compared based after aligning based on the index (if True, the default) or if the intersection operation should be performed row-wise based on position.
So the warning you're getting, and the resulting NaNs, are because you're performing an elementwise comparison on data with unmatched indices. The same issue would occur in pandas when trying to merge columns for DataFrames with un-aligned indices.
If you're trying to find the mapping from points to polygons across all combinations of rows across the two dataframes, you're looking for a spatial join, which can be done with geopandas.sjoin:
intersection_mb = geopandas.sjoin(
buffers_df,
rest_VIC,
how='outer',
predicate='intersects',
)
See the geopandas guide to merging data for more info.

Related

Similarity of the geometry according to number of nodes and their coordinate values

My question is: If number of nodes in both geometries are same and coordinates are also same (ordering of coordinates might different), can I conclude that both geometries are same?
I have two sets of coordinates (X and Y) of a geometry. I have counted number of nodes available in both geometries and also have coordinates.
I checked whether second geometry coordinates are all available in first cooridnates or not using below line of code and as a result, I got 'Yes'.
new_coordi1 = [(0.26, 0.27), (0.56, 1.22), (1.51, 0.93), (1.21, -0.03), (0.92, -0.98), (1.87, -1.28), (2.17, -0.33), (2.47, 0.63), (-0.04, -0.69), (-0.34, -1.64), (-1.29, -1.34), (-0.99, -0.39), (-0.7, 0.57), (-1.65, 0.86), (-1.95, -0.09), (-2.25, -1.05), (-1.35, 1.82), (-0.4, 1.52)]
new_coordi2 = [(0.26, 0.27), (0.56, 1.22), (1.51, 0.93), (1.21, -0.03), (0.92, -0.98), (1.87, -1.28), (2.17, -0.33), (2.47, 0.63), (-0.04, -0.69), (-0.99, -0.39), (-1.29, -1.34), (-0.34, -1.64), (-2.25, -1.05), (-1.95, -0.09), (-1.65, 0.86), (-0.7, 0.57), (-0.4, 1.52), (-1.35, 1.82)]
check = all(item in new_coordi2 for item in new_coordi1)
Background of this Question: I have two geometries. These geometries are same and they were oriented in the different way in 2D space. I aligned them using Principal Component Analysis and place both geometries at the common orientation. I want to confirm whether two geometries are same or not.
I cannot think of a better description than "the geometries are the same to a reordering of the points".
Note that PCA seems overkill. You could just sort the point lexicographically on the coordinates* and compare point-wise.
*Unless the values are noisy.
I think this approach works.
new_coordd1 = sorted(new_coordi1 , key=lambda tup: (tup[0],tup[1]) )
new_coordd2 = sorted(new_coordi2 , key=lambda tupp: (tupp[0],tupp[1]) )
new_coordd1 == new_coordd2
>>> True

Nested loops altering rows in pandas - Avoiding "A value is trying to be set on a copy of a slice from a DataFrame"

Summary
I am trying to loop through a pandas dataframe, and to run a secondary loop at each iteration. The secondary loop calculates something that I want to append into the original dataframe, so that when the primary loop advances, some of the rows are recalculated based on the changed values. (For those interested, this is a simple advective model of carbon accumulation in soils. When a new layer of soil is deposited, mixing processes penetrate into older layers and transform their properties to a set depth. Thus, each layer deposited changes those below it incrementally, until a former layer lies below the mixing depth.)
I have produced an example of how I want this to work, however it is generating the common error message:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self._setitem_single_block(indexer, value, name)
I have looked into the linked information in the error message as well as myriad posts on this forum, but none get into the continual looping through a changed dataframe.
What I've tried, and some possible solutions
Below is some example code. This code works more or less as well as I want it to. But it produces the warning. Should I:
Suppress the warning and continue working with this architecture? In this case, am I asking for trouble with un-reproducible results?
Try a different architecture altogether, like a numpy array from the original dataframe?
Try df.append() or df.copy() to avoid the warning?
I have tried `df.copy()' to no avail - the warning was still thrown.
Example code:
import pandas as pd
a = pd.DataFrame(
{
'a':[x/2 for x in range(1,11)],
'b':['hot dog', 'slider', 'watermelon', 'funnel cake', 'cotton candy', 'lemonade', 'fried oreo', 'ice cream', 'corn', 'sausage'],
'c':['meat', 'meat', 'vegan', 'vegan', 'vegan', 'vegan', 'dairy','dairy', 'vegan', 'meat']
}
)
print(a)
z = [x/(x+2) for x in range(1,5)]
print(z)
#Primary loop through rows of the main dataframe
for ind, row in a.iterrows():
#Pull out a chunk of the dataframe. This is the portion of the dataframe that will be modified. What is below
#this is already modified and locked into the geological record. What is above has not yet been deposited.
b = a.iloc[ind:(ind+len(z)), :]
#Define the size of the secondary loop. Taking the minimum avoids the model mixing below the boundary layer (key error)
loop = min([len(z), len(b)])
#Now loop through the sub-dataframe and change accordingly.
for fraction in range(loop):
b['a'].iloc[fraction] = b['a'].iloc[fraction]*z[fraction]
#Append the original dataframe with new data:
a.iloc[ind:(ind+loop), :] = b
#Try df.copy(), but still throws warning!
#a.iloc[ind:(ind+loop), :] = b.copy()
print(a)

How to preprocess a dataset with many types of missing data

I'm trying to do the beginner machine learning project Big Mart Sales.
The data set of this project contains many types of missing values (NaN), and values that need to be changed (lf -> Low Fat, reg -> Regular, etc.)
My current approach to preprocess this data is to create an imputer for every type of data needs to be fixed:
from sklearn.impute import SimpleImputer as Imputer
# make the values consistent
lf_imputer = Imputer(missing_values='LF', strategy='constant', fill_value='Low Fat')
lowfat_imputer = Imputer(missing_values='low fat', strategy='constant', fill_value='Low Fat')
X[:,1:2] = lf_imputer.fit_transform(X[:,1:2])
X[:,1:2] = lowfat_imputer.fit_transform(X[:,1:2])
# nan for a categorical variable
nan_imputer = Imputer(missing_values=np.nan, strategy='most_frequent')
X[:, 7:8] = nan_imputer.fit_transform(X[:, 7:8])
# nan for a numerical variable
nan_num_imputer = Imputer(missing_values=np.nan, strategy='mean')
X[:, 0:1] = nan_num_imputer.fit_transform(X[:, 0:1])
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
In addition, it is frustrating that imputer.fit_transform() requires a 2D array as an input whereas I only want to fix the values in a single column (1D). Thus, I always have to use the column that I want to fix plus a column next to it as inputs. Is there any other way to get around this? Thanks.
Here are some rows of my data:
There is a python package which can do this for you in a simple way, ctrl4ai
pip install ctrl4ai
from ctrl4ai import preprocessing
preprocessing.impute_nulls(dataset)
Usage: [arg1]:[pandas dataframe],[method(default=central_tendency)]:[Choose either central_tendency or KNN]
Description: Auto identifies the type of distribution in the column and imputes null values
Note: KNN consumes more system mermory if the size of the dataset is huge
Returns: Dataframe [with separate column for each categorical values]
However, this approach is pretty cumbersome. Is there any neater way to preprocess this data set?
If you have a numerical column, you can use some approaches to fill the missing data:
A constant value that has meaning within the domain, such as 0, distinct from all other values.
A value from another randomly selected record.
A mean, median or mode value for the column.
A value estimated by another predictive model.
Lets see how it works for a mean for one column e.g.:
One method would be to use fillna from pandas:
X['Name'].fillna(X['Name'].mean(), inplace=True)
For categorical data please have a look here: Impute categorical missing values in scikit-learn

Dask processing of uneven dataframes with fuzzywuzzy

I'm attempting to merge two large dataframes (one 50k+ values, and another with 650k+ values - pared down from 7M+). Merging/matching is being done via fuzzywuzzy, to find which string in the first dataframe matches which string in the other most closely.
At the moment, it takes about 3 minutes to test 100 rows for variables. Consequently, I'm attempting to institute Dask to help with the processing speed. In doing so, Dask returns the following error: 'NotImplementedError: Series getitem in only supported for other series objects with matching partition structure'
Presumably, the error is due to my dataframes not being of equal size. In trying to set a chunksize when converting the my pandas dataframes to dask dataframes, I receive an error (TypeError: 'float' object cannot be interpreted as an integer) even though I previous forced all my datatypes in each dataframe to 'objects'. Consequently, I was forced to use the npartitions parameter in the dataframe conversion, which then leads to the 'NotImplementedError' above.
I've tried to standardize the chunksize with partitions with a mathematical index, and also tried using the npartitions parameter to no effect, and resulting in the same NotImplementedError.
As mentioned my efforts to utilize this without Dask have been successful, but far too slow to be useful.
I've also taken a look at these questions/responses:
- Different error
- No solution presented
- Seems promising, but results are still slow
''''
aprices_filtered_ddf = dd.from_pandas(prices_filtered, chunksize = 25e6) #prices_filtered: 404.2MB
all_data_ddf = dd.from_pandas(all_data, chunksize = 25e6) #all_data: 88.7MB
# import dask
client = Client()
dask.config.set(scheduler='processes')
# Define matching function
def match_name(name, list_names, min_score=0):
# -1 score incase we don't get any matches max_score = -1
# Returning empty name for no match as well
max_name = ""
# Iterating over all names in the other
for name2 in list_names:
#Finding fuzzy match score
score = fuzz.token_set_ratio(name, name2)
# Checking if we are above our threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = name2
max_score = score
return (max_name, max_score)
# List for dicts for easy dataframe creation
dict_list = []
# iterating over our players without salaries found above for name in prices_filtered_ddf['ndc_description'][:100]:
# Use our method to find best match, we can set a threshold here
match = client(match_name(name, all_data_ddf['ndc_description_agg'], 80))
# New dict for storing data
dict_ = {}
dict_.update({'ndc_description_agg' : name})
dict_.update({'ndc_description' : match[0]})
dict_.update({'score' : match[1]})
dict_list.append(dict_)
merge_table = pd.DataFrame(dict_list)
# Display results
merge_table
Here's the full error:
NotImplementedError Traceback (most recent call last)
<ipython-input-219-e8d4dcb63d89> in <module>
3 dict_list = []
4 # iterating over our players without salaries found above
----> 5 for name in prices_filtered_ddf['ndc_description'][:100]:
6 # Use our method to find best match, we can set a threshold here
7 match = client(match_name(name, all_data_ddf['ndc_description_agg'], 80))
C:\Anaconda\lib\site-packages\dask\dataframe\core.py in __getitem__(self, key)
2671 return Series(graph, name, self._meta, self.divisions)
2672 raise NotImplementedError(
-> 2673 "Series getitem in only supported for other series objects "
2674 "with matching partition structure"
2675 )
NotImplementedError: Series getitem in only supported for other series objects with matching partition structure
''''
I expect that the merge_table will return, in a relatively short time, a dataframe with data for each of the update columns. At the moment, it's extremely slow.
I'm afraid there are a number of problems with this question, so after pointing these out, I can only provide some general guidance.
The traceback shown is clearly not produced by the code above
The indentation and syntax are broken
A distributed client is made, then config set not to use it ("processes" is not the distributed scheduler)
The client object is called, client(...), but it is not callable, this shouldn't work at all
The main processing function, match_name is called directly; how do you expect Dask to intervene?
You don't ever call compute(), so in the code given, I'm not sure Dask is invoked at all.
What you actually want to do:
Load your smaller, reference dataframe using pandas, and call client.scatter to make sure all the workers have it
Load your main data with dd.read_csv
Call df.map_partitions(..) to process your data, where the function you pass should take two pandas dataframes, and work row-by-row.

ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1)!= 16 (dim 0)

I am working on housing dataset and when trying to fit the linear regression model getting error as mentioned. Complete code as below.
I am not sure where is code going wrong. I tried pasting the code as it is from the reference book.
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:\t", lin_reg.predict(some_data_prepared))
ERROR: ValueError: shapes (5,14) and (16,) not aligned: 14 (dim 1) != 16 (dim 0)
What am I doing wrong here?
Explanation
Hi, I guess you are reading and following the Hands on Machine Learning with Scikit Learn and Tensorflow book. The problem also occurred to me.
In the following part of the code you select from the data set the first 5 instances. One of the attributes in the data set which is called ocean_proximity is an object and for the linear regression model to be able to operate with it, it must be translated to an integer, which in the book is done with a one hot encoding.
One hot encoding works by analyzing all the categories that can be assigned to the attribute, in this case 5 ('<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND'), and then creating a matrix of that length for each instance and zeroing every element of the matrix except the category of that instance which is assigned a 1 (or another value). For example:
If ocean_proximity equals '<1H OCEAN' the conversion would be [1, 0, 0, 0, 0]
In this piece of code you select the five first instances of the data set, but this does not assure you that all the categories in "ocean_proximity" will appear. It could happen that only 3 of them appear or just 1. Therefor if you apply a one hot encoding to those five selected rows and only 3 categories appear (for example just 'INLAND', 'ISLAND' and 'NEAR BAY'), the matrices created by the one hot encoding will be of length 3.
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
The error is just telling you that, since the one hot conversion of some_data created matrices of a length inferior to 5, the total columns in some_data_prepared is 14, which is less than the columns in housing_prepared (16), thus making the model unable to predict the prices.
If you transform both some_data_prepared and housing_prepared into dataframes and then call .head() you will see the problem.
some_data_prepared.head()
housing_prepared.head()
Solution
To solve the problem you must create the columns missing in some_data_prepared by creating a zeroed numpy array of shape [5,x] (being 5 the number of rows and x the number of columns missing) and concatenating it to some_data_prepared to match the shape of the housing_prepared data set.
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.fit_transform(some_data)
dummy_array = np.zeros((5,1))
some_data_prepared = np.c_[some_data_prepared, dummy_array]
predictions = linear_regression.predict(some_data_prepared)
print("Predictions: ", predictions)
print("Labels: ", some_labels.values)
Missing category values (ocean proximity in this case) in some_data compared to housing_prepared is the issue.
housing_prepared.shape gives (16512, 16), but some_data_prepared.shape gives (5,14), so add zeros for the missing columns:
dummy_array = np.zeros((5,2))
some_data_prepared = np.c_[some_data_prepared,dummy_array]
the 2 in np.zeros determines the difference of columns
I've at first encountered the same issue on the considered piece of code. After exploring the issues of the handson-ml repository, I think I have understood the subtlety which is causing the error here.
My guess is that (as in my case), closing the notebook might have caused what was in memory (and the trained model in particular) to be lost. In my case, I could get the result and avoid the error rerunning the notebook from the beginning.
Instead, from a theoretical viewpoint, you should never call fit() or fit_transform() on data which is not training data (eg on some_data). Here, running fit_transform(some_data) and then stacking the dummy array to some_data_prepared works, but it forces the model to be trained again on some_data rather than on housing_prepared, which is not what you want.

Resources