pandas: calculate fuzzywuzzy for each category separately - python-3.x

I have a dataset as follows, only with more rows:
import pandas as pd
data = {'First': ['First value','Third value','Second value','First value','Third value','Second value'],
'Second': ['the old man is here','the young girl is there', 'the old woman is here','the young boy is there','the young girl is here','the old girl is here']}
df = pd.DataFrame (data, columns = ['First','Second'])
i have calculated the fuzzywuzzy average for the entire dataset like this:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
def similarity_measure(doc1, doc2):
return fuzz.token_set_ratio(doc1, doc2)
d= df.groupby('First')['Second'].apply(lambda x: (', '.join(x)))
d= d.reset_index()
all=[]
for val in list(combinations(range(len(d)), 2)):
all.append(similarity_measure(d.iloc[val[0],1],d.iloc[val[1],1]))
avg = sum(all)/len(all)
print('lexical overlap between all example pairs in the dataset is: ', avg)
however, I would like to also get this average for each category in the first column separately.
so, i would like something like(for example):
similarity average for sentences in First value: 85.56
similarity average for sentences in Second value: 89.01
similarity average for sentences in Third value: 90.01
so I would like to modify the for loop in a way that i would have the above output.

To compute the mean within each group, you need two steps:
To group by some criteria, in your case column First. It seems that you already know how.
Create a function to compute the similarity for a group the all_similarity_measure function in the code below.
Code
import pandas as pd
from fuzzywuzzy import fuzz
from itertools import combinations
def similarity_measure(doc1, doc2):
return fuzz.token_set_ratio(doc1, doc2)
data = {'First': ['First value', 'Third value', 'Second value', 'First value', 'Third value', 'Second value'],
'Second': ['the old man is here', 'the young girl is there', 'the old woman is here', 'the young boy is there',
'the young girl is here', 'the old girl is here']}
df = pd.DataFrame(data, columns=['First', 'Second'])
def all_similarity_measure(gdf):
"""This function computes the similarity between all pairs of sentences in a Series"""
return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()
res = df.groupby('First', as_index=False)['Second'].apply(all_similarity_measure)
print(res)
Output
First Second
0 First value 63.0
1 Second value 86.0
2 Third value 98.0
The key to compute the mean similarity is this expression:
return pd.Series([similarity_measure(*docs) for docs in combinations(gdf, 2)]).mean()
basically you generate the pairs of sentences using combinations (no need to access by index), construct a Series and compute mean on it.
Any function for computing the mean can be use instead of the above, for example, you could use statistics.mean, to avoid constructing a Series.
from statistics import mean
def all_similarity_measure(gdf):
"""This function computes the similarity between all pairs of sentences in a Series"""
return mean(similarity_measure(*docs) for docs in combinations(gdf, 2))

Related

How to find the average amount of time somebody won a race by Python Pandas

So, as seen in the dataframe, there's 3 races. I want to find the time difference between 1st and second place for each race, then the output would be the average that each runner would win each race by.
import pandas as pd
# initialise data of lists.
data = {'Name':['A', 'B', 'B', 'C', 'A', 'C'], 'RaceNumber':
[1, 1, 2, 2, 3, 3], 'PlaceWon':['First', 'Second', 'First', 'Second', 'First', 'Second'], 'TimeRanInSec':[100, 98, 66, 60, 75, 70]}
# Create DataFrame
df = pd.DataFrame(data)
# Print the output.
print(df)
In this case, The output would be a data frame that outputs A won races by an average of 3.5 sec. B won by an average of 6 sec.
I imagine this could be done by grouping by RaceNumber and then subtracting TimeRanInSec. But unsure how to get the average of each Name.
I think you need two groupby operations, one to get the winning margin for each race, and then one to get the average winning margin for each person.
For a general solution, I would first define a function that calculates the winning margin from a list of times (for one race). Then you can apply that function to the times in each race group and join the resulting winning margins to the dataframe of all the winners. Then it's easy to get the desired averages:
def winning_margin(times):
times = list(times)
winner = min(times)
times.remove(winner)
return min(times) - winner
winning_margins = df[['RaceNumber', 'TimeRanInSec']] \
.groupby('RaceNumber').agg(winning_margin)
winning_margins.columns = ['margin']
winners = df.loc[df.PlaceWon == 'First', :]
winners = winners.join(winning_margins, on='RaceNumber')
avg_margins = winners[['Name', 'margin']].groupby('Name').mean()
avg_margins
margin
Name
A 3.5
B 6.0

Creating a function using dictionary to change column from strings to integers

I am a complete newbie to Spark. I have an RDD with a column that has the strings of {'Fair', 'Good', 'Better', 'Best'} and I want to create a function that will change those to {1, 2, 3, 4} using a dictionary. This is what I have so far but it is not working, it comes back with string object has no attribute of items. I am using a RDD, not Pandas data frame. I need the function to be able to use UDF to change the original data frame. The function would be followed with
spark.udf.register( , ).
Examples of data:
Name Rank Price
Red Best 25.00
Blue Fair 5.00
Yellow Good 8.00
Green Better 20.00
Black Good 12.00
White Fair 7.00
def rank(n):
b = {"Fair": 1, "Good": 2, "Better": 3, "Best": 4}
rep = {v : k for k, v in b.items()}
return rep
spark.udf.register('RANK', rank)
df.select(
'*',
expr('RANK(Rank)')).show(5)
This works:
def rank(n):
if n == "Fair":
return 1
elif n == "Good":
return 2
elif n == "Better":
return 3
elif n == "Best":
return 4
else:
return n
spark.udf.register('RANK(rank), rank)
But I want a simpler formula.
from pyspark.sql.functions import col, create_map, lit
from itertools import chain
mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])
df.withColumn("new_column", mapping_expr.getItem(col("old_column")))
where mapping is your dict - don’t call it list, this name is already used by the list class in Python.

Pandas dataframe column names seem wrong

I'm a student and therefore a rookie. I'm trying to create a Pandas dataframe of crime statistics by neighborhood in San Francisco. My problem is that I want the column names to be simply "Neighborhood" and "Count". Instead I seem to be stuck with a separate line that says "('Neighborhood', 'count')" instead of the proper labels. Here's the code:
df_counts = df_incidents.copy()
df_counts.rename(columns={'PdDistrict':'Neighborhood'}, inplace=True)
df_counts.drop(['IncidntNum', 'Category', 'Descript', 'DayOfWeek', 'Date', 'Time', 'Location', 'Resolution', 'Address', 'X', 'Y', 'PdId'], axis=1, inplace=True)
df_totals=df_counts.groupby(['Neighborhood']).agg({'Neighborhood':['count']})
df_totals.columns = list(map(str, df_totals.columns)) # Not sure if I need this
df_totals
Output:
('Neighborhood', 'count')
Neighborhood
BAYVIEW 14303
CENTRAL 17666
INGLESIDE 11594
MISSION 19503
NORTHERN 20100
PARK 8699
RICHMOND 8922
SOUTHERN 28445
TARAVAL 11325
TENDERLOIN 9942
No need for agg() here, you can simply do:
df_totals = df_counts.groupby(['Neighborhood']).count()
df_totals.columns = ['count']
df_totals = df_totals.reset_index() # flatten the column headers
And if you want to print the output without the numerical index:
print(df_totals.to_string(index=False))

Python - unable to count occurences of values in defined ranges in dataframe

I'm trying to write a code that takes analyses values in a dataframe, if the values fall in a class, the total number of those values are assigned to a key in the dictionary. But the code is not working for me. Im trying to create logarithmic classes and count the total number of values that fall in it
def bins(df):
"""Returns new df with values assigned to bins"""
bins_dict = {500: 0, 5000: 0, 50000: 0, 500000: 0}
for i in df:
if 100<i and i<=1000:
bins_dict[500]+=1,
elif 1000<i and i<=10000:
bins_dict[5000]+=1
print(bins_dict)
However, this is returning the original dictionary.
I've also tried modifying the dataframe using
def transform(df, range):
for i in df:
for j in range:
b=10**j
while j==1:
while i>100:
if i>=b:
j+=1,
elif i<b:
b = b/2,
print (i = b*(int(i/b)))
This code is returning the original dataframe.
My dataframe consists of only one column with values ranging between 100 and 10000000
Data Sample:
Area
0 1815
1 907
2 1815
3 907
4 907
Expected output
dict={500:3, 5000:2, 50000:0}
If i can get a dataframe output directly that would be helpful too
PS. I am very new to programming and I only know python
You need to use pandas for it:
import pandas as pd
df = pd.DataFrame()
df['Area'] = [1815, 907, 1815, 907, 907]
# create new column to categorize your data
df['bins'] = pd.cut(df['Area'], [0,1000,10000,100000], labels=['500', '5000', '50000'])
# converting into dictionary
dic = dict(df['bins'].value_counts())
print(dic)
Output:
{'500': 3, '5000': 2, '50000': 0}

keep unique words in a pandas dataframe row

Dataframe:
> df
>type(df)
pandas.core.frame.DataFrame
ID Property Type Amenities
1952043 Apartment, Villa, Apartment Park, Jogging Track, Park
1918916 Bungalow, Cottage House, Cottage, Bungalow Garden, Play Ground
How can I keep just the unique words separated by "comma" in the dataframe row? In this case it must not consider "Cottage House" and "Cottage" same. It must check this for all columns of the dataframe. So my desired output should look like below:
Desired Output :
ID Property Type Amenities
1952043 Apartment, Villa Park, Jogging Track
1918916 Bungalow, Cottage House, Cottage Garden, Play Ground
First, I create a function that does what you want for a given string. Secondly, I apply this function to all strings in the column.
import numpy as np
import pandas as pd
df = pd.DataFrame([['Apartment, Villa, Apartment',
'Park, Jogging Track, Park'],
['Bungalow, Cottage House, Cottage, Bungalow',
'Garden, Play Ground']],
columns=['Property Type', 'Amenities'])
def drop_duplicates(row):
# Split string by ', ', drop duplicates and join back.
words = row.split(', ')
return ', '.join(np.unique(words).tolist())
# drop_duplicates is applied to all rows of df.
df['Property Type'] = df['Property Type'].apply(drop_duplicates)
df['Amenities'] = df['Amenities'].apply(drop_duplicates)
print(df)
Read the file into pandas DataFrame
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
The main idea is to
iterate through every row,
split the string in the target column by ,
return the unique set() of the list from step 2
Code:
>>> for row in proptype_column: # Step 1.
... items_in_row = row.split(', ') # Step 2.
... uniq_items_in_row = set(row.split(', ')) # Step 3.
... print(uniq_items_in_row)
...
set(['Apartment', 'Villa'])
set(['Cottage', 'Bungalow', 'Cottage House'])
Now you can achieve the same with DataFrame.apply() function:
>>> import pandas as pd
>>> df = pd.read_csv('test.txt', sep='\t')
>>> df['Property Type'].apply(lambda cell: set([c.strip() for c in cell.split(',')]))
0 {Apartment, Villa}
1 {Cottage, Bungalow, Cottage House}
Name: Property Type, dtype: object
>>> proptype_uniq = df['Property Type'].apply(lambda cell: set(cell.split(', ')))
>>> df['Property Type (Unique)'] = proptype_uniq
>>> df
ID Property Type \
0 12345 Apartment, Villa, Apartment
1 67890 Bungalow, Cottage House, Cottage, Bungalow
Amenities Property Type (Unique)
0 Park, Jogging Track, Park {Apartment, Villa}
1 Garden, Play Ground {Cottage, Bungalow, Cottage House}

Resources