Drop similar text rows of one column in Python

Drop similar text rows of one column in Python - python-3.x

import pandas as pd
from difflib import SequenceMatcher
df = pd.DataFrame({"id":[9,12,13,14],
"text":["Error number 609 at line 10", "Error number 609 at line 22", "Error string 'foo' at line 11", "Error string 'bar' at line 14"]})
Output:
id text
0 9 Error number 609 at line 10
1 12 Error number 609 at line 22
2 13 Error string 'foo' at line 11
3 14 Error string 'bar' at line 14
I want to use difflib.SequenceMatcher to remove similarity score lower than 80 rows and only keep one.
a = "Error number 609 at line 10"
b = "Error number 609 at line 22"
c = "Error string 'foo' at line 11"
d = "Error string 'bar' at line 14"
print(SequenceMatcher(None, a, b).ratio()*100) #92.5925925925926
print(SequenceMatcher(None, b, c).ratio()*100) #60.71428571428571
print(SequenceMatcher(None, c, d).ratio()*100) #86.20689655172413
print(SequenceMatcher(None, a, c).ratio()*100) #64.28571428571429
How can I get expected result as follows in Python? You can use difflib or other python packages. Thank you.
id text
0 9 Error number 609 at line 10
2 13 Error string 'foo' at line 11

You can use:
#cross join with filter onl text column
df = df.assign(a=1).merge(df[['text']].assign(a=1), on='a')
#filter out same columns per rows
df = df[df['text_x'] != df['text_y']]
#sort columns per rows
df[['text_x','text_y']] = pd.DataFrame(np.sort(df[['text_x','text_y']],axis=1), index=df.index)
#remove duplicates
df = df.drop_duplicates(subset=['text_x','text_y'])
#get similarity
df['r'] = df.apply(lambda x: SequenceMatcher(None, x.text_x, x.text_y).ratio(), axis=1)
#filtering
df = df[df['r'] > 0.8].drop(['a','r'], axis=1)
print (df)
id text_x text_y
1 9 Error number 609 at line 10 Error number 609 at line 22
11 13 Error string 'bar' at line 14 Error string 'foo' at line 11

Related

To find the location at which error has occured

I need to do the data validation for range. To check wheather the column values are within the given range if the value is greater or less than the given range error should occur and display the row no or index where the error has been occured .
my data is as follows:
Draft_Fore
12
14
87
16
90
It should produce the error for the value 87 and 90 as I have considered the range of the column must be greater than 5 and less than 20.
The code which I have tried is as follows:
def validate_rating(Draft_Fore):
Draft_Fore = int(Draft_Fore)
if Draft_Fore > 5 and Draft_Fore <= 20:
return True
return False
df = pd.read_csv("/home/anu/Desktop/dr.csv")
for i, Draft_Fore in enumerate(df):
try:
validate_rating(Draft_Fore)
except Exception as e:
print('Error at index {}: {!r}'.format(i, Draft_Fore))
print(e)
To print the location where the error has occured in the row

A little explanation to clarify my comment. Assuming your dataframe looks like
df = pd.DataFrame({'col1': [12, 14, 87, 16, 90]})
you could do
def check_in_range(v, lower_lim, upper_lim):
if lower_lim < v <= upper_lim:
return True
return False
lower_lim, upper_lim = 5, 20
for i, v in enumerate(df['col1']):
if not check_in_range(v, lower_lim, upper_lim):
print(f"value {v} at index {i} is out of range!")
# --> gives you
value 87 at index 2 is out of range!
value 90 at index 4 is out of range!
So your check function is basically fine. However, if you call to enumerate a df, the values will be the column names. What you need is to enumerate the specific column.
Concerning your idea to raise an exception, I'd suggest to have a look at raise and assert.
So you could e.g. use raise:
for i, v in enumerate(df['col1']):
if not check_in_range(v, lower_lim, upper_lim):
raise ValueError(f"value {v} at index {i} is out of range")
# --> gives you
ValueError: value 87 at index 2 is out of range
or assert:
for i, v in enumerate(df['col1']):
assert v > lower_lim and v <= upper_lim, f"value {v} at index {i} is out of range"
# --> gives you
AssertionError: value 87 at index 2 is out of range
Note: If you have a df, why not use its features for convenience? To get the in-range values of the column, you could just do
df[(df['col1'] > lower_lim) & (df['col1'] <= upper_lim)]
# --> gives you
col1
0 12
1 14
3 16

Using nunique to tag the duplicate values in a dataframe but getting an error

I am trying to tag unique value with a comment but I am getting TypeError: string indices must be integers
Input
Key
ab
bc
df
ab
Output
Key | Comment
ab | Check it
bc |
df |
ab |Check it
condition_2= lambda x: "Check it" if x["Key"].nunique()>=1 else 0
df["Comments"]=semi_final_df.Key.apply(condition_2)
Error:
TypeError Traceback (most recent call last)
<ipython-input-175-dc8d1ac8148f> in <module>
----> 1 semi_final_df["Comments"]=semi_final_df.Key.apply(condition_2)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
3192 else:
3193 values = self.astype(object).values
-> 3194 mapped = lib.map_infer(values, f, convert=convert_dtype)
3195
3196 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src\inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-174-cf54900ff760> in <lambda>(x)
----> 1 condition_2= lambda x: " Check it" if x["Key"].nunique()>=1 else 0
TypeError: string indices must be integers```

Use Series.duplicated with keep=False for mask for all dupes with numpy.where:
df["Comments"]= np.where(df.Key.duplicated(keep=False), "Check it", '')
print (df)
Key Comments
0 ab Check it
1 bc
2 df
3 ab Check it

Cannot plot dataframe as barh because TypeError: Empty 'DataFrame': no numeric data to plot

I have been all over this site and google trying to solve this problem.
It appears as though I'm missing a fundamental concept in making a plottable dataframe.
I've tried to ensure that I have a column of strings for the "Teams" and a column of ints for the "Points"
Still I get: TypeError: Empty 'DataFrame': no numeric data to plot
import csv
import pandas
import numpy
import matplotlib.pyplot as plt
from matplotlib.ticker import StrMethodFormatter
set_of_teams = set()
def load_epl_games(file_name):
with open(file_name, newline='') as csvfile:
reader = csv.DictReader(csvfile)
raw_data = {"HomeTeam": [], "AwayTeam": [], "FTHG": [], "FTAG": [], "FTR": []}
for row in reader:
set_of_teams.add(row["HomeTeam"])
set_of_teams.add(row["AwayTeam"])
raw_data["HomeTeam"].append(row["HomeTeam"])
raw_data["AwayTeam"].append(row["AwayTeam"])
raw_data["FTHG"].append(row["FTHG"])
raw_data["FTAG"].append(row["FTAG"])
raw_data["FTR"].append(row["FTR"])
data_frame = pandas.DataFrame(data=raw_data)
return data_frame
def calc_points(team, table):
points = 0
for row_number in range(table["HomeTeam"].count()):
home_team = table.loc[row_number, "HomeTeam"]
away_team = table.loc[row_number, "AwayTeam"]
if team in [home_team, away_team]:
home_team_points = 0
away_team_points = 0
winner = table.loc[row_number, "FTR"]
if winner == 'H':
home_team_points = 3
elif winner == 'A':
away_team_points = 3
else:
home_team_points = 1
away_team_points = 1
if team == home_team:
points += home_team_points
else:
points += away_team_points
return points
def get_goals_scored_conceded(team, table):
scored = 0
conceded = 0
for row_number in range(table["HomeTeam"].count()):
home_team = table.loc[row_number, "HomeTeam"]
away_team = table.loc[row_number, "AwayTeam"]
if team in [home_team, away_team]:
if team == home_team:
scored += int(table.loc[row_number, "FTHG"])
conceded += int(table.loc[row_number, "FTAG"])
else:
scored += int(table.loc[row_number, "FTAG"])
conceded += int(table.loc[row_number, "FTHG"])
return (scored, conceded)
def compute_table(df):
raw_data = {"Team": [], "Points": [], "GoalDifference":[], "Goals": []}
for team in set_of_teams:
goal_data = get_goals_scored_conceded(team, df)
raw_data["Team"].append(team)
raw_data["Points"].append(calc_points(team, df))
raw_data["GoalDifference"].append(goal_data[0] - goal_data[1])
raw_data["Goals"].append(goal_data[0])
data_frame = pandas.DataFrame(data=raw_data)
data_frame = data_frame.sort_values(["Points", "GoalDifference", "Goals"], ascending=[False, False, False]).reset_index(drop=True)
data_frame.index = numpy.arange(1,len(data_frame)+1)
data_frame.index.names = ["Finish"]
return data_frame
def get_finish(team, table):
return table[table.Team==team].index.item()
def get_points(team, table):
return table[table.Team==team].Points.item()
def display_hbar(tables):
raw_data = {"Team": [], "Points": []}
for row_number in range(tables["Team"].count()):
raw_data["Team"].append(tables.loc[row_number+1, "Team"])
raw_data["Points"].append(int(tables.loc[row_number+1, "Points"]))
df = pandas.DataFrame(data=raw_data)
#df = pandas.DataFrame(tables, columns=["Team", "Points"])
print(df)
print(df.dtypes)
df["Points"].apply(int)
print(df.dtypes)
df.plot(kind='barh',x='Points',y='Team')
games = load_epl_games('epl2016.csv')
final_table = compute_table(games)
#print(final_table)
#print(get_finish("Tottenham", final_table))
#print(get_points("West Ham", final_table))
display_hbar(final_table)
The output:
Team Points
0 Chelsea 93
1 Tottenham 86
2 Man City 78
3 Liverpool 76
4 Arsenal 75
5 Man United 69
6 Everton 61
7 Southampton 46
8 Bournemouth 46
9 West Brom 45
10 West Ham 45
11 Leicester 44
12 Stoke 44
13 Crystal Palace 41
14 Swansea 41
15 Burnley 40
16 Watford 40
17 Hull 34
18 Middlesbrough 28
19 Sunderland 24
Team object
Points int64
dtype: object
Team object
Points int64
dtype: object
Traceback (most recent call last):
File "C:/Users/Michael/Documents/Programming/Python/Premier League.py", line 99, in <module>
display_hbar(final_table)
File "C:/Users/Michael/Documents/Programming/Python/Premier League.py", line 92, in display_hbar
df.plot(kind='barh',x='Points',y='Team')
File "C:\Program Files (x86)\Python36-32\lib\site- packages\pandas\plotting\_core.py", line 2941, in __call__
sort_columns=sort_columns, **kwds)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1977, in plot_frame
**kwds)
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 1804, in _plot
plot_obj.generate()
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 258, in generate
self._compute_plot_data()
File "C:\Program Files (x86)\Python36-32\lib\site-packages\pandas\plotting\_core.py", line 373, in _compute_plot_data
'plot'.format(numeric_data.__class__.__name__))
TypeError: Empty 'DataFrame': no numeric data to plot
What am I doing wrong in my display_hbar function that is preventing me from plotting my data?
Here is the csv file

df.plot(x = "Team", y="Points", kind="barh");

You should swap x and y in df.plot(...). Because y must be numeric according to the pandas documentation.

What is the meaning of 'NK' in pandas int64?

I have a column pathsize (int64). However, I got some values define as 'NK'. I've tried to convert this value into an integer, but it doesn't seem to have any effect.
NK 687
15 180
12 172
14 166
...
3 123
Name: pathsize, Length: 92, dtype: int64
The script I used to convert NK into 0:
def pathsize(row):
if (row["pathsize"] != 'NK'):
return row["pathsize"]
return 0
df['pathsize'] = df.apply(pathsize, axis=1)
The script works fine, but when I try to process the data (convert it as a float), I got this following error:
ValueError: could not convert string to float: ' NK'

TypeError: unhashable type: 'Int64Index'

The section of my code that is causing me problems is
def Half_Increase(self):
self.keg_count=summer17.iloc[self.result_rows,2].values[0]
self.keg_count +=1
summer17[self.result_rows,2] = self.keg_count
print(keg_count)
So this function is to be executed when a button widget is pressed. It's supposed to get the value from a specific cell in a dataframe, add 1 to it, and then return the new value to the dataframe. (I'm not entirely sure if this is the proper way to do this.)
I get the following error
Exception in Tkinter callback
Traceback (most recent call last):
File "C:\Python3.6\lib\tkinter\__init__.py", line 1699, in __call__
return self.func(*args)
File "beerfest_program_v0.3.py", line 152, in Half_Increase
summer17[self.result_rows,2] = self.keg_count
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 2331, in __setitem__
self._set_item(key, value)
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 2397, in _set_item
value = self._sanitize_column(key, value)
File "C:\Python3.6\lib\site-packages\pandas\core\frame.py", line 2596, in _sanitize_column
if broadcast and key in self.columns and value.ndim == 1:
File "C:\Python3.6\lib\site-packages\pandas\core\indexes\base.py", line 1640, in __contains__
hash(key)
File "C:\Python3.6\lib\site-packages\pandas\core\indexes\base.py", line 1667, in __hash__
raise TypeError("unhashable type: %r" % type(self).__name__)
TypeError: unhashable type: 'Int64Index'
I'm guessing this has something to do with the variable types not matching but I've looked and cant find how to remedy this.

I think you need iloc:
summer17.iloc[result_rows,2] += 1
Sample:
summer17 = pd.DataFrame({'a':[1,2,3],
'b':[3,4,5],
'c':[5,9,7]})
#if reselt_rows is scalar
result_rows = 1
print(summer17)
a b c
0 1 3 5
1 2 4 9
2 3 5 7
summer17.iloc[result_rows,2] += 1
print(summer17)
a b c
0 1 3 5
1 2 4 10
2 3 5 7
It is same as:
#get value
keg_count=summer17.iloc[result_rows,2]
#increment
keg_count +=1
#set value
summer17.iloc[result_rows,2] = keg_count
print(summer17)
a b c
0 1 3 5
1 2 4 10
2 3 5 7
But if result_rows is list or 1d array:
result_rows = [1,2]
#get all values per positions defined in result_rows
#filter only first value by values[0]
keg_count=summer17.iloc[result_rows,2].values[0]
#increment
keg_count +=1
#set all values of result_rows by incremented value
summer17.iloc[result_rows,2] = keg_count
print(summer17)
a b c
0 1 3 5
1 2 4 10
2 3 5 10

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Drop similar text rows of one column in Python - python-3.x

Related

To find the location at which error has occured

Using nunique to tag the duplicate values in a dataframe but getting an error

Cannot plot dataframe as barh because TypeError: Empty 'DataFrame': no numeric data to plot

What is the meaning of 'NK' in pandas int64?

TypeError: unhashable type: 'Int64Index'

Categories

Resources