this is part of my code.it reads from an excel file.
I'm getting a type error saying "TypeError: sequence item 0: expected str instance, list found".
text=df.loc[page,["rev"]]
def remove_punct(text):
text=''.join([ch for ch in text if ch not in exclude])
tokens = re.split('\W+', text),
tex = " ".join([word for word in tokens if word not in cachedStopWords]),
return tex
s=df.loc[page,["rev"]].apply(lambda x:remove_punct(x))
this is the error.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-4f3c29307e88> in <module>()
26 return tokens
27
---> 28 s=df.loc[page,["rev"]].apply(lambda x:remove_punct(x))
29
30 with open('FileName.csv', 'a', encoding="utf-8") as f:
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3190 else:
3191 values = self.astype(object).values
-> 3192 mapped = lib.map_infer(values, f, convert=convert_dtype)
3193
3194 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-16-4f3c29307e88> in <lambda>(x)
26 return tokens
27
---> 28 s=df.loc[page,["rev"]].apply(lambda x:remove_punct(x))
29
30 with open('FileName.csv', 'a', encoding="utf-8") as f:
<ipython-input-16-4f3c29307e88> in remove_punct(text)
23 text=''.join([ch for ch in text if ch not in exclude])
24 tokens = re.split('\W+', text),
---> 25 tex = " ".join([ch for ch in tokens if ch not in cachedStopWords]),
26 return tokens
27
TypeError: sequence item 0: expected str instance, list found
I think the commas at the end of these two lines create a list of the variables you are trying to process.
tokens = re.split('\W+', text), # <---- These commas at the end
tex = " ".join([word for word in tokens if word not in cachedStopWords]), # <----
It would result in roughly the same as if you did something like this (edited for better example):
x = 12 * 24,
y = x * 10,
z = 40
print(f"X = {x}\n"
f"Y = {y}\n"
f"Z = {z}\n")
Output:
X = (288,)
Y = ((288, 288, 288, 288, 288, 288, 288, 288, 288, 288),)
Z = 40
The commas result in packing and unpacking of your variables.
Related
Goal: Calculate 8 different Intersection of Union area numbers, each concerning the intersection of 3 MultiPolygons.
There are 3 sources, each representing the same 8 groups of shapes.
Mathematically, my instinct is to refer to the Jaccard Index.
Data
I've 3 MultiPolygon lists:
extracted_multipoly
original_multipoly
wkt_multipoly
They each contain e.g.:
[<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e5a8cbb0>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e319fb50>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e303fe20>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e30805e0>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e302d7f0>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e5a2aaf0>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e5a2a160>,
<shapely.geometry.multipolygon.MultiPolygon at 0x7f18e5a2ae20>]
Extracting area:
extracted_multipoly_area = [mp.area for mp in extracted_multipoly]
original_multipoly_area = [mp.area for mp in original_multipoly]
wkt_multipoly_area = [mp.area for mp in wkt_multipoly]
They each contain e.g.:
[17431020.0,
40348778.0,
5453911.5,
5982124.5,
8941145.5,
11854195.5,
10304965.0,
31896495.0]
Procedure Attempts
Using MultiPolygon:
for i, e in enumerate(extracted_multipoly):
for j, o in enumerate(original_multipoly):
for k, w in enumerate(wkt_multipoly):
if e.intersects(o) and e.intersects(w):
print(i, j, k, (e.intersection(o, w).area/e.area)*100)
[2022-11-18 10:06:40,387][ERROR] TopologyException: side location conflict at 8730 14707. This can occur if the input geometry is invalid.
---------------------------------------------------------------------------
PredicateError Traceback (most recent call last)
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/predicates.py:15, in BinaryPredicate.__call__(self, this, other, *args)
14 try:
---> 15 return self.fn(this._geom, other._geom, *args)
16 except PredicateError as err:
17 # Dig deeper into causes of errors.
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/geos.py:609, in errcheck_predicate(result, func, argtuple)
608 if result == 2:
--> 609 raise PredicateError("Failed to evaluate %s" % repr(func))
610 return result
PredicateError: Failed to evaluate <_FuncPtr object at 0x7f193af77280>
During handling of the above exception, another exception occurred:
TopologicalError Traceback (most recent call last)
Cell In [38], line 4
2 for j, o in enumerate(original_multipoly):
3 for k, w in enumerate(wkt_multipoly):
----> 4 if e.intersects(o) and e.intersects(w):
5 print(i, j, k, (e.intersection(o, w).area/e.area)*100)
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/geometry/base.py:799, in BaseGeometry.intersects(self, other)
797 def intersects(self, other):
798 """Returns True if geometries intersect, else False"""
--> 799 return bool(self.impl['intersects'](self, other))
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/predicates.py:18, in BinaryPredicate.__call__(self, this, other, *args)
15 return self.fn(this._geom, other._geom, *args)
16 except PredicateError as err:
17 # Dig deeper into causes of errors.
---> 18 self._check_topology(err, this, other)
File ~/miniconda3/envs/pdl1lung/lib/python3.9/site-packages/shapely/topology.py:37, in Delegating._check_topology(self, err, *geoms)
35 for geom in geoms:
36 if not geom.is_valid:
---> 37 raise TopologicalError(
38 "The operation '%s' could not be performed. "
39 "Likely cause is invalidity of the geometry %s" % (
40 self.fn.__name__, repr(geom)))
41 raise err
TopologicalError: The operation 'GEOSIntersects_r' could not be performed. Likely cause is invalidity of the geometry <shapely.geometry.multipolygon.MultiPolygon object at 0x7f18e5be2f70>
Using area:
for e, o, w in zip(extracted_multipoly_area, original_multipoly_area, wkt_multipoly_area):
print(e, o, w)
print(e.intersection(o, w))
22347776.0 22544384.0 17431020.0
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In [31], line 3
1 for e, o, w in zip(extracted_multipoly_area, original_multipoly_area, wkt_multipoly_area):
2 print(e, o, w)
----> 3 print(e.intersection(o, w))
AttributeError: 'float' object has no attribute 'intersection'
Solution
IoU values should be between 0 to 1.
intersection_of_union = []
for e, o in zip(extracted_multipoly, original_multipoly):
e = e.buffer(0)
o = o.buffer(0)
intersection_area = e.intersection(o).area
intersection_of_union.append(intersection_area / (e.area + o.area - intersection_area))
[0.8970148657684971,
0.9377700784370339,
0.8136220015019057,
0.8980586930524846,
0.8496839666124079,
0.8428598403182237,
0.8599616483904042,
0.9550894396247209]
Adapted from tutorial.
Reposting again because i didn't get a response to the first post
I have the following data is below:
desc = pd.DataFrame(description, columns =['new_desc'])
new_desc
257623 the public safety report is compiled from crim...
161135 police say a sea isle city man ordered two pou...
156561 two people are behind bars this morning, after...
41690 pumpkin soup is a beloved breakfast soup in ja...
70092 right now, 15 states are grappling with how be...
... ...
207258 operation legend results in 59 more arrests, i...
222170 see story, 3a
204064 st. louis — missouri secretary of state jason ...
151443 tony lavell jones, 54, of sunset view terrace,...
97367 walgreens, on the other hand, is still going t...
[9863 rows x 1 columns]
I'm trying to find the dominant topic within the documents, and When I run the following code
best_lda_model = lda_desc
data_vectorized = tfidf
lda_output = best_lda_model.transform(data_vectorized)
topicnames = ["Topic " + str(i) for i in range(best_lda_model.n_components)]
docnames = ["Doc " + str(i) for i in range(len(dataset))]
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topicnames, index = docnames)
dominant_topic = np.argmax(df_document_topic.values, axis = 1)
df_document_topic['dominant_topic'] = dominant_topic
I've tried tweaking the code, however, no matter what I change, I get the following error tracebook error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
c:\python36\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_blocks(blocks, axes)
1673
-> 1674 mgr = BlockManager(blocks, axes)
1675 mgr._consolidate_inplace()
c:\python36\lib\site-packages\pandas\core\internals\managers.py in __init__(self, blocks, axes, do_integrity_check)
148 if do_integrity_check:
--> 149 self._verify_integrity()
150
c:\python36\lib\site-packages\pandas\core\internals\managers.py in _verify_integrity(self)
328 if block.shape[1:] != mgr_shape[1:]:
--> 329 raise construction_error(tot_items, block.shape[1:], self.axes)
330 if len(self.items) != tot_items:
ValueError: Shape of passed values is (9863, 8), indices imply (0, 8)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-41-bd470d69b181> in <module>
4 topicnames = ["Topic " + str(i) for i in range(best_lda_model.n_components)]
5 docnames = ["Doc " + str(i) for i in range(len(dataset))]
----> 6 df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns = topicnames, index = docnames)
7 dominant_topic = np.argmax(df_document_topic.values, axis = 1)
8 df_document_topic['dominant_topic'] = dominant_topic
c:\python36\lib\site-packages\pandas\core\frame.py in __init__(self, data, index, columns, dtype, copy)
495 mgr = init_dict({data.name: data}, index, columns, dtype=dtype)
496 else:
--> 497 mgr = init_ndarray(data, index, columns, dtype=dtype, copy=copy)
498
499 # For data is list-like, or Iterable (will consume into list)
c:\python36\lib\site-packages\pandas\core\internals\construction.py in init_ndarray(values, index, columns, dtype, copy)
232 block_values = [values]
233
--> 234 return create_block_manager_from_blocks(block_values, [columns, index])
235
236
c:\python36\lib\site-packages\pandas\core\internals\managers.py in create_block_manager_from_blocks(blocks, axes)
1679 blocks = [getattr(b, "values", b) for b in blocks]
1680 tot_items = sum(b.shape[0] for b in blocks)
-> 1681 raise construction_error(tot_items, blocks[0].shape[1:], axes, e)
1682
1683
ValueError: Shape of passed values is (9863, 8), indices imply (0, 8)
The desired results is to produce a list of documents according to a specific topic. Below is example code and desired output.
df_document_topic(df_document_topic['dominant_topic'] == 2).head(10)
When I run this code, I get the following traceback
TypeError Traceback (most recent call last)
<ipython-input-55-8cf9694464e6> in <module>
----> 1 df_document_topic(df_document_topic['dominant_topic'] == 2).head(10)
TypeError: 'DataFrame' object is not callable
Below is the desired output
Any help would be greatly appreciated.
The index you're passing as docnames is empty which is obtained from dataset as follows:
docnames = ["Doc " + str(i) for i in range(len(dataset))]
So this means that the dataset is empty too. For a workaround, you can create Doc indices based on the size of lda_output as follows:
docnames = ["Doc " + str(i) for i in range(len(lda_output))]
Let me know if this works.
this is the error which i am getting. In the previous post i forget to put both function . In the first function i'm reading csv file and removing punctuation and send the string to second function to calculate the sentimentel score. this code give output for few row of csv file and then show this error i'm new in python
Traceback (most recent call last):
File "C:/Users/Public/Downloads/Hotelsurvey.py", line 116, in <module>
Countswordofeachsyntax()
File "C:/Users/Public/Downloads/Hotelsurvey.py", line 92, in Countswordofeachsyntax
print(findsentimentalscore(nopunct))
File "C:/Users/Public/Downloads/Hotelsurvey.py", line 111, in findsentimentalscore
ss =ss + weight
TypeError: unsupported operand type(s) for +: 'int' and 'list'
def Countswordofeachsyntax():
nopunct = ""
with open('dataset-CalheirosMoroRita-2017.csv', 'r') as csv_file:
csv_reader = csv.reader(csv_file, delimiter='|')
for sys in csv_reader:
for value in sys:
nopunct = ""
for ch in value:
if ch not in punctuation:
nopunct = nopunct + ch
print(findsentimentalscore(nopunct))
def findsentimentalscore(st):
ss = 0
count = len(st.split())
mycollapsedstring = ' '.join(st.split())
print(str(mycollapsedstring.split(' ')) + " := " + str(len(mycollapsedstring.split())))
for key, weight in keywords.items():
if key in mycollapsedstring.lower():
ss =ss + weight
#print(key, weight)
res = (ss / count * 100)
return math.ceil(res)
I am working on a dataset with tweets and I am trying to find the mentions to other users in a tweet, these tweets can have none, single or multiple users mentioned.
Here is the head of the DataFrame:
The following is the function that I created to extract the list of mentions in a tweet:
def getMention(text):
mention = re.findall('(^|[^#\w])#(\w{1,15})', text)
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
I'm trying to create a new column in the DataFrame and apply the function with the following code:
df['mention'] = df['text'].apply(getMention)
On running this code I get the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-43-426da09a8770> in <module>
----> 1 df['mention'] = df['text'].apply(getMention)
~/anaconda3_501/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-42-d27373022afd> in getMention(text)
1 def getMention(text):
2
----> 3 mention = re.findall('(^|[^#\w])#(\w{1,15})', text)
4 if len(mention) > 0:
5 return [x[1] for x in mention]
~/anaconda3_501/lib/python3.6/re.py in findall(pattern, string, flags)
220
221 Empty matches are included in the result."""
--> 222 return _compile(pattern, flags).findall(string)
223
224 def finditer(pattern, string, flags=0):
TypeError: expected string or bytes-like object
I can't comment (not enough rep) so here's what I suggest to troubleshoot the error.
It seems findall raises an exception because text is not a string so you might want to check which type text actually is, using this:
def getMention(text):
print(type(text))
mention = re.findall(r'(^|[^#\w])#(\w{1,15})', text)
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
(or the debugger if you know how to)
And if text can be converted to a string maybe try this ?
def getMention(text):
mention = re.findall(r'(^|[^#\w])#(\w{1,15})', str(text))
if len(mention) > 0:
return [x[1] for x in mention]
else:
return None
P.S.: don't forget the r'...' in front of your regexp, to avoid special chars to be interpreted
I have been all over this site and google trying to solve this problem.
It appears as though I'm missing a fundamental concept in making a plottable dataframe.
I've tried to ensure that I have a column of strings for the "Teams" and a column of ints for the "Points"
Still I get: TypeError: Empty 'DataFrame': no numeric data to plot
import csv
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
set_of_teams = set()
def load_epl_games(file_name):
with open(file_name, newline='') as csvfile:
reader = csv.DictReader(csvfile)
raw_data = {"HomeTeam": [], "AwayTeam": [], "FTHG": [], "FTAG": [], "FTR": []}
for row in reader:
set_of_teams.add(row["HomeTeam"])
set_of_teams.add(row["AwayTeam"])
raw_data["HomeTeam"].append(row["HomeTeam"])
raw_data["AwayTeam"].append(row["AwayTeam"])
raw_data["FTHG"].append(row["FTHG"])
raw_data["FTAG"].append(row["FTAG"])
raw_data["FTR"].append(row["FTR"])
data_frame = pandas.DataFrame(data=raw_data)
return data_frame
def calc_points(team, table):
points = 0
for row_number in range(table["HomeTeam"].count()):
home_team = table.loc[row_number, "HomeTeam"]
away_team = table.loc[row_number, "AwayTeam"]
if team in [home_team, away_team]:
home_team_points = 0
away_team_points = 0
winner = table.loc[row_number, "FTR"]
if winner == 'H':
home_team_points = 3
elif winner == 'A':
away_team_points = 3
else:
home_team_points = 1
away_team_points = 1
if team == home_team:
points += home_team_points
else:
points += away_team_points
return points
def get_goals_scored_conceded(team, table):
scored = 0
conceded = 0
for row_number in range(table["HomeTeam"].count()):
home_team = table.loc[row_number, "HomeTeam"]
away_team = table.loc[row_number, "AwayTeam"]
if team in [home_team, away_team]:
if team == home_team:
scored += int(table.loc[row_number, "FTHG"])
conceded += int(table.loc[row_number, "FTAG"])
else:
scored += int(table.loc[row_number, "FTAG"])
conceded += int(table.loc[row_number, "FTHG"])
return (scored, conceded)
def compute_table(df):
raw_data = {"Team": [], "Points": [], "GoalDifference":[], "Goals": []}
for team in set_of_teams:
goal_data = get_goals_scored_conceded(team, df)
raw_data["Team"].append(team)
raw_data["Points"].append(calc_points(team, df))
raw_data["GoalDifference"].append(goal_data[0] - goal_data[1])
raw_data["Goals"].append(goal_data[0])
data_frame = pandas.DataFrame(data=raw_data)
data_frame = data_frame.sort_values(["Points", "GoalDifference", "Goals"], ascending=[False, False, False]).reset_index(drop=True)
data_frame.index = numpy.arange(1,len(data_frame)+1)
data_frame.index.names = ["Finish"]
return data_frame
def get_finish(team, table):
return table[table.Team==team].index.item()
def get_points(team, table):
return table[table.Team==team].Points.item()
def display_hbar(tables):
raw_data = {"Team": [], "Points": []}
for row_number in range(tables["Team"].count()):
raw_data["Team"].append(tables.loc[row_number+1, "Team"])
raw_data["Points"].append(int(tables.loc[row_number+1, "Points"]))
df = pandas.DataFrame(data=raw_data)
#df = pandas.DataFrame(tables, columns=["Team", "Points"])
print(df)
print(df.dtypes)
df["Points"].apply(int)
print(df.dtypes)
df.plot(kind='barh',x='Points',y='Team')
games = load_epl_games('epl2016.csv')
final_table = compute_table(games)
#print(final_table)
#print(get_finish("Tottenham", final_table))
#print(get_points("West Ham", final_table))
display_hbar(final_table)
The output:
Team Points
0 Chelsea 93
1 Tottenham 86
2 Man City 78
3 Liverpool 76
4 Arsenal 75
5 Man United 69
6 Everton 61
7 Southampton 46
8 Bournemouth 46
9 West Brom 45
10 West Ham 45
11 Leicester 44
12 Stoke 44
13 Crystal Palace 41
14 Swansea 41
15 Burnley 40
16 Watford 40
17 Hull 34
18 Middlesbrough 28
19 Sunderland 24
Team object
Points int64
dtype: object
Team object
Points int64
dtype: object
Traceback (most recent call last):
File "C:/Users/Michael/Documents/Programming/Python/Premier League.py", line 99, in <module>
display_hbar(final_table)
File "C:/Users/Michael/Documents/Programming/Python/Premier League.py", line 92, in display_hbar
df.plot(kind='barh',x='Points',y='Team')
File "C:\Program Files (x86)\Python36-32\lib\site- packages\pandas\plotting\_core.py", line 2941, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1977, in plot_frame
**kwds)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1804, in _plot
plot_obj.generate()
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 258, in generate
self._compute_plot_data()
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 373, in _compute_plot_data
'plot'.format(numeric_data.__class__.__name__))
TypeError: Empty 'DataFrame': no numeric data to plot
What am I doing wrong in my display_hbar function that is preventing me from plotting my data?
Here is the csv file
df.plot(x = "Team", y="Points", kind="barh");
You should swap x and y in df.plot(...). Because y must be numeric according to the pandas documentation.