Shapely Polygons | Calculate Intersection of Union - python-3.x

Goal: Calculate 8 different Intersection of Union area numbers, each concerning the intersection of 3 MultiPolygons.
There are 3 sources, each representing the same 8 groups of shapes.
Mathematically, my instinct is to refer to the Jaccard Index.
Data
I've 3 MultiPolygon lists:
extracted_multipoly
original_multipoly
wkt_multipoly
They each contain e.g.:
[<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e5a8cbb0>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e319fb50>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e303fe20>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e30805e0>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e302d7f0>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e5a2aaf0>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e5a2a160>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e5a2ae20>]
Extracting area:
extracted_multipoly_area = [mp.area for mp in extracted_multipoly]
original_multipoly_area = [mp.area for mp in original_multipoly]
wkt_multipoly_area = [mp.area for mp in wkt_multipoly]
They each contain e.g.:
[17431020.0,
40348778.0,
5453911.5,
5982124.5,
8941145.5,
11854195.5,
10304965.0,
31896495.0]
Procedure Attempts
Using MultiPolygon:
for i, e in enumerate(extracted_multipoly):
for j, o in enumerate(original_multipoly):
for k, w in enumerate(wkt_multipoly):
if e.intersects(o) and e.intersects(w):
print(i, j, k, (e.intersection(o, w).area/e.area)*100)
[2022-11-18 10:06:40,387][ERROR] TopologyException: side location conflict at 8730 14707. This can occur if the input geometry is invalid.
---------------------------------------------------------------------------
PredicateError Traceback (most recent call last)
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/predicates.py:15, in BinaryPredicate.__call__(self, this, other, *args)
14 try:
---> 15 return self.fn(this._geom, other._geom, *args)
16 except PredicateError as err:
17 # Dig deeper into causes of errors.
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/geos.py:609, in errcheck_predicate(result, func, argtuple)
608 if result == 2:
--> 609 raise PredicateError("Failed to evaluate %s" % repr(func))
610 return result
PredicateError: Failed to evaluate <_FuncPtr object at 0x7f193af77280>
During handling of the above exception, another exception occurred:
TopologicalError Traceback (most recent call last)
Cell In [38], line 4
2 for j, o in enumerate(original_multipoly):
3 for k, w in enumerate(wkt_multipoly):
----> 4 if e.intersects(o) and e.intersects(w):
5 print(i, j, k, (e.intersection(o, w).area/e.area)*100)
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/geometry/base.py:799, in BaseGeometry.intersects(self, other)
797 def intersects(self, other):
798 """Returns True if geometries intersect, else False"""
--> 799 return bool(self.impl['intersects'](self, other))
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/predicates.py:18, in BinaryPredicate.__call__(self, this, other, *args)
15 return self.fn(this._geom, other._geom, *args)
16 except PredicateError as err:
17 # Dig deeper into causes of errors.
---> 18 self._check_topology(err, this, other)
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/topology.py:37, in Delegating._check_topology(self, err, *geoms)
35 for geom in geoms:
36 if not geom.is_valid:
---> 37 raise TopologicalError(
38 "The operation '%s' could not be performed. "
39 "Likely cause is invalidity of the geometry %s" % (
40 self.fn.__name__, repr(geom)))
41 raise err
TopologicalError: The operation 'GEOSIntersects_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f18e5be2f70>
Using area:
for e, o, w in zip(extracted_multipoly_area, original_multipoly_area, wkt_multipoly_area):
print(e, o, w)
print(e.intersection(o, w))
22347776.0 22544384.0 17431020.0
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [31], line 3
1 for e, o, w in zip(extracted_multipoly_area, original_multipoly_area, wkt_multipoly_area):
2 print(e, o, w)
----> 3 print(e.intersection(o, w))
AttributeError: 'float' object has no attribute 'intersection'

Solution
IoU values should be between 0 to 1.
intersection_of_union = []
for e, o in zip(extracted_multipoly, original_multipoly):
e = e.buffer(0)
o = o.buffer(0)
intersection_area = e.intersection(o).area
intersection_of_union.append(intersection_area / (e.area + o.area - intersection_area))
[0.8970148657684971,
0.9377700784370339,
0.8136220015019057,
0.8980586930524846,
0.8496839666124079,
0.8428598403182237,
0.8599616483904042,
0.9550894396247209]
Adapted from tutorial.

Related

I'm getting " ValueError: 111816 is not in range" error when trying to use FuzzyWuzzy between two other dataframe column

I am getting error when trying to use FuzzyWuzzy between two other dataframe column.
I want to match df_1['name_new'] to df['term'].
below is the site where I got my code
https://towardsdatascience.com/fuzzy-string-match-with-python-on-large-dataset-and-why-you-should-not-use-fuzzywuzzy-4ec9f0defcd
#Transform text to vectors with TF-IDF:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=5, token_pattern='(\S+)')
tf_idf_matrix_1 = tfidf_vectorizer.fit_transform(df_1['name_new'])
tf_idf_matrix_2 = tfidf_vectorizer.fit_transform(df['term'])
I careated "tf_idf_matrix_2 " to match other df's 'term' column
from scipy.sparse import csr_matrix
!pip install sparse_dot_topn
import sparse_dot_topn.sparse_dot_topn as ct
def awesome_cossim_top(A, B, ntop, lower_bound=0):
# force A and B as a CSR matrix.
# If they have already been CSR, there is no overhead
A = A.tocsr()
B = B.tocsr()
M, _ = A.shape
_, N = B.shape
idx_dtype = np.int32
nnz_max = M*ntop
indptr = np.zeros(M+1, dtype=idx_dtype)
indices = np.zeros(nnz_max, dtype=idx_dtype)
data = np.zeros(nnz_max, dtype=A.dtype)
ct.sparse_dot_topn(
M, N, np.asarray(A.indptr, dtype=idx_dtype),
np.asarray(A.indices, dtype=idx_dtype),
A.data,
np.asarray(B.indptr, dtype=idx_dtype),
np.asarray(B.indices, dtype=idx_dtype),
B.data,
ntop,
lower_bound,
indptr, indices, data)
return csr_matrix((data,indices,indptr),shape=(M,N))
import time
t1 = time.time()
# adjust lower bound: 0.8
# keep top 10 similar results
matches = awesome_cossim_top(tf_idf_matrix_1, tf_idf_matrix_2.transpose(), 10, 0.8)
t = time.time()-t1
print("finished in:", t)
def get_matches_df(sparse_matrix, name_vector, top=10000):
non_zeros = sparse_matrix.nonzero()
sparserows = non_zeros[0]
sparsecols = non_zeros[1]
if top:
nr_matches = top
else:
nr_matches = sparsecols.size
left_side = np.empty([nr_matches], dtype=object)
right_side = np.empty([nr_matches], dtype=object)
similairity = np.zeros(nr_matches)
for index in range(0, nr_matches):
left_side[index] = name_vector[sparserows[index]]
right_side[index] = name_vector[sparsecols[index]]
similairity[index] = sparse_matrix.data[index]
return pd.DataFrame({'name_new_1': left_side,
'term_1': right_side,
'similairity_score': similairity})
matches_df = pd.DataFrame()
matches_df = get_matches_df(matches, df_1['name_new'], top=10000)
# Remove all exact matches
I get my error like this=>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
384 try:
--> 385 return self._range.index(new_key)
386 except ValueError as err:
ValueError: 111816 is not in range
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
4 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
385 return self._range.index(new_key)
386 except ValueError as err:
--> 387 raise KeyError(key) from err
388 raise KeyError(key)
389 return super().get_loc(key, method=method, tolerance=tolerance)
KeyError: 111816
Please help... what is wrong with my code?

ValueError: Shape of passed values is, indices imply

Reposting again because i didn't get a response to the first post
I have the following data is below:
desc = pd.DataFrame(description, columns =['new_desc'])
new_desc
257623 the public safety report is compiled from crim...
161135 police say a sea isle city man ordered two pou...
156561 two people are behind bars this morning, after...
41690 pumpkin soup is a beloved breakfast soup in ja...
70092 right now, 15 states are grappling with how be...
... ...
207258 operation legend results in 59 more arrests, i...
222170 see story, 3a
204064 st. louis — missouri secretary of state jason ...
151443 tony lavell jones, 54, of sunset view terrace,...
97367 walgreens, on the other hand, is still going t...
[9863 rows x 1 columns]
I'm trying to find the dominant topic within the documents, and When I run the following code
best_lda_model = lda_desc
data_vectorized = tfidf
lda_output = best_lda_model.transform(data_vectorized)
topicnames = ["Topic " + str(i) for i in range(best_lda_model.n_components)]
docnames = ["Doc " + str(i) for i in range(len(dataset))]
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topicnames, index = docnames)
dominant_topic = np.argmax(df_document_topic.values, axis = 1)
df_document_topic['dominant_topic'] = dominant_topic
I've tried tweaking the code, however, no matter what I change, I get the following error tracebook error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
c:\python36\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_blocks(blocks, axes)
1673
-> 1674 mgr = BlockManager(blocks, axes)
1675 mgr._consolidate_inplace()
c:\python36\lib\site-packages\pandas\core\internals\managers.py in __init__(self, blocks, axes, do_integrity_check)
148 if do_integrity_check:
--> 149 self._verify_integrity()
150
c:\python36\lib\site-packages\pandas\core\internals\managers.py in _verify_integrity(self)
328 if block.shape[1:] != mgr_shape[1:]:
--> 329 raise construction_error(tot_items, block.shape[1:], self.axes)
330 if len(self.items) != tot_items:
ValueError: Shape of passed values is (9863, 8), indices imply (0, 8)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-41-bd470d69b181> in <module>
4 topicnames = ["Topic " + str(i) for i in range(best_lda_model.n_components)]
5 docnames = ["Doc " + str(i) for i in range(len(dataset))]
----> 6 df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topicnames, index = docnames)
7 dominant_topic = np.argmax(df_document_topic.values, axis = 1)
8 df_document_topic['dominant_topic'] = dominant_topic
c:\python36\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
495 mgr = init_dict({data.name: data}, index, columns, dtype=dtype)
496 else:
--> 497 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
498
499 # For data is list-like, or Iterable (will consume into list)
c:\python36\lib\site-packages\pandas\core\internals\construction.py in init_ndarray(values, index, columns, dtype, copy)
232 block_values = [values]
233
--> 234 return create_block_manager_from_blocks(block_values, [columns, index])
235
236
c:\python36\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_blocks(blocks, axes)
1679 blocks = [getattr(b, "values", b) for b in blocks]
1680 tot_items = sum(b.shape[0] for b in blocks)
-> 1681 raise construction_error(tot_items, blocks[0].shape[1:], axes, e)
1682
1683
ValueError: Shape of passed values is (9863, 8), indices imply (0, 8)
The desired results is to produce a list of documents according to a specific topic. Below is example code and desired output.
df_document_topic(df_document_topic['dominant_topic'] == 2).head(10)
When I run this code, I get the following traceback
TypeError Traceback (most recent call last)
<ipython-input-55-8cf9694464e6> in <module>
----> 1 df_document_topic(df_document_topic['dominant_topic'] == 2).head(10)
TypeError: 'DataFrame' object is not callable
Below is the desired output
Any help would be greatly appreciated.
The index you're passing as docnames is empty which is obtained from dataset as follows:
docnames = ["Doc " + str(i) for i in range(len(dataset))]
So this means that the dataset is empty too. For a workaround, you can create Doc indices based on the size of lda_output as follows:
docnames = ["Doc " + str(i) for i in range(len(lda_output))]
Let me know if this works.

Dataframe Pandas aggregation and/or groupby

I have a dataframe like this:
serie = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3]
values = [2, 2, 2, 1, 2, 2, 1, 1, 1, 1, 1, 2]
series_X_values = {'series': serie, 'values': values}
df_mytest = pd.DataFrame.from_dict(series_X_values)
df_mytest
I need to create a third column (for example more frequently)
df_mytest['most_frequent'] = np.nan
whose values will be the most frequently observed in the 'values' column grouped by 'series', or replace the values in the 'values' column with the most frequent term itself as in the dataframe below:
serie = [1, 2, 3]
values = [2, 2, 1]
series_X_values = {'series': serie, 'values': values}
df_mytest = pd.DataFrame.from_dict(series_X_values)
df_mytest
I tried some unsuccessful options like:
def personal_most_frequent(col_name):
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy="most_frequent")
return imp
df_result = df_mytest.groupby('series').apply(personal_most_frequent('values'))
but...
TypeError Traceback (most recent call
last)
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py
in apply(self, func, *args, **kwargs)
688 try:
--> 689 result = self._python_apply_general(f)
690 except Exception:
5 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py
in _python_apply_general(self, f)
706 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707 self.axis)
708
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/ops.py in
apply(self, f, data, axis)
189 group_axes = _get_axes(group)
--> 190 res = f(group)
191 if not _is_indexed_like(res, group_axes):
TypeError: 'SimpleImputer' object is not callable
During handling of the above exception, another exception occurred:
TypeError Traceback (most recent call
last) in ()
5 return imp
6
----> 7 df_result = df_mytest.groupby('series').apply(personal_most_frequent('values'))
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py
in apply(self, func, *args, **kwargs)
699
700 with _group_selection_context(self):
--> 701 return self._python_apply_general(f)
702
703 return result
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/groupby.py
in _python_apply_general(self, f)
705 def _python_apply_general(self, f):
706 keys, values, mutated = self.grouper.apply(f, self._selected_obj,
--> 707 self.axis)
708
709 return self._wrap_applied_output(
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/ops.py in
apply(self, f, data, axis)
188 # group might be modified
189 group_axes = _get_axes(group)
--> 190 res = f(group)
191 if not _is_indexed_like(res, group_axes):
192 mutated = True
TypeError: 'SimpleImputer' object is not callable
and...
df_mytest.groupby(['series', 'values']).agg(lambda x:x.value_counts().index[0])
but again...
IndexError Traceback (most recent call
last)
/usr/local/lib/python3.6/dist-packages/pandas/core/groupby/ops.py in
agg_series(self, obj, func)
589 try:
--> 590 return self._aggregate_series_fast(obj, func)
591 except Exception:
12 frames pandas/_libs/reduction.pyx in
pandas._libs.reduction.SeriesGrouper.get_result()
pandas/_libs/reduction.pyx in
pandas._libs.reduction.SeriesGrouper.get_result()
IndexError: index 0 is out of bounds for axis 0 with size 0
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call
last)
/usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in
getitem(self, key) 3956 if is_scalar(key): 3957 key = com.cast_scalar_indexer(key)
-> 3958 return getitem(key) 3959 3960 if isinstance(key, slice):
IndexError: index 0 is out of bounds for axis 0 with size 0
I ask for help from the community to complete this process.
Assuming you are OK with tie-breaking equally represented values by taking the max value, you could do something like:
df_mf = df_mytest.groupby('series')['values'].apply(lambda ds: ds.mode().max()).to_frame('most_frequent')
df_mytest.merge(df_mf, 'left', left_on='series', right_index=True)
Out:
series values most_frequent
0 1 2 2
1 1 2 2
2 1 2 2
3 1 1 2
4 2 2 2
5 2 2 2
6 2 1 2
7 2 1 2
8 3 1 1
9 3 1 1
10 3 1 1
11 3 2 1

How to impute values in a column and overwrite existing values

Im trying to learn machine learning and i need to fill in the missing values for the cleaning stage of the workflow. i have 13 columns and need to impute the values for 8 of them. One column is called Dependents and i want to fill in the blanks with the word missing and change the cells that do contain data as follows: 1 to one, two to 2, 3 to three and 3+ to threePlus.
Im running the program in Anaconda and the name of the dataframe is train
train.columns
this gives me
Index(['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education',
'Self_Employed', 'ApplicantIncome', 'CoapplicantIncome', 'LoanAmount',
'Loan_Amount_Term', 'Credit_History', 'Property_Area', 'Loan_Status'],
dtype='object')
next
print("Dependents")
print(train['Dependents'].unique())
this gives me
Dependents
['0' '1' '2' '3+' nan]
now i try imputing values as stated
def impute_dependent():
my_dict={'1':'one','2':'two','3':'three','3+':'threePlus'};
return train.Dependents.map(my_dict).fillna('missing')
def convert_data(dataset):
temp_data = dataset.copy()
temp_data['Dependents'] = temp_data[['Dependents']].apply(impute_dependent,axis=1)
return temp_data
this gives the error
TypeError Traceback (most recent call last)
<ipython-input-46-ccb1a5ea7edd> in <module>()
4 return temp_data
5
----> 6 train_dataset = convert_data(train)
7 #test_dataset = convert_data(test)
<ipython-input-46-ccb1a5ea7edd> in convert_data(dataset)
1 def convert_data(dataset):
2 temp_data = dataset.copy()
----> 3 temp_data['Dependents'] =
temp_data[['Dependents']].apply(impute_dependent,axis=1)
4 return temp_data
5
D:\Anaconda2\lib\site-packages\pandas\core\frame.py in apply(self, func,
axis, broadcast, raw, reduce, result_type, args, **kwds)
6002 args=args,
6003 kwds=kwds)
-> 6004 return op.get_result()
6005
6006 def applymap(self, func):
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in get_result(self)
140 return self.apply_raw()
141
--> 142 return self.apply_standard()
143
144 def apply_empty_result(self):
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in apply_standard(self)
246
247 # compute the result using the series generator
--> 248 self.apply_series_generator()
249
250 # wrap results
D:\Anaconda2\lib\site-packages\pandas\core\apply.py in
apply_series_generator(self)
275 try:
276 for i, v in enumerate(series_gen):
--> 277 results[i] = self.f(v)
278 keys.append(v.name)
279 except Exception as e:
TypeError: ('impute_dependent() takes 0 positional arguments but 1 was
given', 'occurred at index 0')
i expected one, two , three and threePlus to replace the existing values and missing to fill in the blanks
Would this do?
my_dict = {'1':'one','2':'two','3':'three','3+':'threePlus', np.nan: 'missing'}
def convert_data(dataset):
temp_data = dataset.copy()
temp_data.Dependents = temp_data.Dependents.map(my_dict)
return temp_data
As a side note, part of your problem might be the use of apply: essentially apply passes data through a function and puts in what comes out. I might be wrong but I think your function needs to take the input given by apply, eg:
def impute_dependent(dep):
my_dict = {'1':'one','2':'two','3':'three','3+':'threePlus', np.nan: 'missing'}
return my_dict[dep]
df.dependents = df.dependents.apply(impute_dependents)
This way, for every value in df.dependents, apply will take that value and give it to impute_dependents as an argument, then take the returned value as output. As is, when I trial your code I get an error because impute_dependent takes no arguments.

Getting Type Error Expected Strings or Bytes Like Object

I am working on a dataset with tweets and I am trying to find the mentions to other users in a tweet, these tweets can have none, single or multiple users mentioned.
Here is the head of the DataFrame:
The following is the function that I created to extract the list of mentions in a tweet:
def getMention(text):
mention = re.findall('(^|[^#\w])#(\w{1,15})', text)
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
I'm trying to create a new column in the DataFrame and apply the function with the following code:
df['mention'] = df['text'].apply(getMention)
On running this code I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-426da09a8770> in <module>
----> 1 df['mention'] = df['text'].apply(getMention)
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-42-d27373022afd> in getMention(text)
1 def getMention(text):
2
----> 3 mention = re.findall('(^|[^#\w])#(\w{1,15})', text)
4 if len(mention) > 0:
5 return [x[1] for x in mention]
~/anaconda3_501/lib/python3.6/re.py in findall(pattern, string, flags)
220
221 Empty matches are included in the result."""
--> 222 return _compile(pattern, flags).findall(string)
223
224 def finditer(pattern, string, flags=0):
TypeError: expected string or bytes-like object
I can't comment (not enough rep) so here's what I suggest to troubleshoot the error.
It seems findall raises an exception because text is not a string so you might want to check which type text actually is, using this:
def getMention(text):
print(type(text))
mention = re.findall(r'(^|[^#\w])#(\w{1,15})', text)
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
(or the debugger if you know how to)
And if text can be converted to a string maybe try this ?
def getMention(text):
mention = re.findall(r'(^|[^#\w])#(\w{1,15})', str(text))
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
P.S.: don't forget the r'...' in front of your regexp, to avoid special chars to be interpreted

Resources