Using FuzzyWuzzy with pandas - python-3.x

I am trying to calculate the similarity between cities in my dataframe, and 1 static city name. (eventually I want to iterate through a dataframe and choose the best matching city name from that data frame, but I am testing my code on this simplified scenario).
I am using fuzzywuzzy token set ratio.
For some reason it calculates the first row correctly, and it seems it assigns the same value for all rows.
code:
from fuzzywuzzy import fuzz
test_df= pd.DataFrame( {"City" : ["Amsterdam","Amsterdam","Rotterdam","Zurich","Vienna","Prague"]})
test_df = test_df.assign(Score = lambda d: fuzz.token_set_ratio("amsterdam",test_df["City"]))
print (test_df.shape)
test_df.head()
Result:
City Score
0 Amsterdam 100
1 Amsterdam 100
2 Rotterdam 100
3 Zurich 100
4 Vienna 100
If I do the comparison one by one it works:
print (fuzz.token_set_ratio("amsterdam","Amsterdam"))
print (fuzz.token_set_ratio("amsterdam","Rotterdam"))
print (fuzz.token_set_ratio("amsterdam","Zurich"))
print (fuzz.token_set_ratio("amsterdam","Vienna"))
Results:
100
67
13
13
Thank you in advance!

I managed to solve it via iterating through the rows:
for index,row in test_df.iterrows():
test_df.loc[index, "Score"] = fuzz.token_set_ratio("amsterdam",test_df.loc[index,"City"])
The result is:
City Country Code Score
0 Amsterdam NL 100
1 Amsterdam NL 100
2 Rotterdam NL 67
3 Zurich NL 13
4 Vienna NL 13

Related

How to merge back k-means clustering outputs to the corresponding units in a dataframe

I just wonder this strategy is the correct way to merge back the k-means clustering outputs to the corresponding units in the existing dataframe.
For example, I have a data set which includes user ID, age, income, gender and I want to run a k-means clustering algorithm to find a set of clusters where each cluster has similar users in terms of these characteristics (age, income, gender). Note that I disregard the value difference among the characteristics for the brevity.
existing_dataframe
user_id age income gender
1 13 10 1 (female)
2 34 50 1
3 75 40 0 (male)
4 23 29 0
5 80 45 1
... ... ... ...
existing_dataframe_for_analysis
(Based on my understanding after referring number of tutorials from online sources,
I should not include user_id variable, so I use the below dataframe for the analysis;
please let me know if I am wrong)
age income gender
13 10 1 (female)
34 50 1
75 40 0 (male)
23 29 0
80 45 1
... ... ... ...
Assume that I found the optimal number of clusters from the dataset is 3. So I decided to set it as 3 and predict in which cluster each user is categorized using the below code.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3,
init='k-means++',
max_iter=20,
n_init=10)
model.fit(existing_dataframe_for_analysis)
predicted=model.predict(existing_dataframe_for_analysis)
print (predicted[:5])
The expected out can be shown below:
[0 1 2 1 2]
If I run the below code where I create a new column called 'cluster' which represents the analysis outputs and add the colum to the existing dataframe, does it gaurantee that nth element from the output list corresponds to the nth observation (user id) in the existing dataframe? Please advice.
existing_dataframe['cluster']=predicted
print (existing_dataframe)
output:
user_id age income gender cluster
1 13 10 1 (female) 0
2 34 50 1 1
3 75 40 0 (male) 2
4 23 29 0 1
5 80 45 1 2
... ... ... ... ...
Your approach to rejoin the predictions is correct. Your assumption to not include any ids is also correct. However, I strongly advise you to scale your input variables before doing any clustering, as your variables have different units.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(existing_dataframe_for_analysis)
Then continue working with this new object as you did before.

pandas: search column values from one df in another df column that contains lists

I need to search the values from the df1['numsearch'] column into the lists in df2['Numbers']. If the number is in those lists, then I want to add values from the df2['Score'] column to df1. See desired output below.
df1 = pd.DataFrame(
{'Day':['M','Tu','W','Th','Fr','Sa','Su'],
'numsearch':['1','20','14','99','19','6','101']
})
df2 = pd.DataFrame(
{'Letters':['a','b','c','d'],
'Numbers':[['1','2','3','4'],['5','6','7','8'],['10','20','30','40'],['11','12','13','14']],
'Score': ['1.1','2.2','3.3','4.4']})
desired output
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 4 4.4
3 Th 99 "No score"
4 Fr 19 "No score"
5 Sa 6 2.2
6 Su 101 "No score"
I have written a for loop that works with the test data.
scores = []
for s,ns in enumerate(ppr_data['SN']):
match = ''
for k,q in enumerate(jcr_data['All_ISSNs']):
if ns in q:
scores.append(jcr_data['Journal Impact Factor'][k])
match = 1
else:
continue
if match == "":
scores.append('No score')
match = ""
df1['Score'] = np.array(scores)
In my small test, but above code works, but when working with larger data files, it is creating duplicates. So this clearly isn't the best way to do this.
I'm sure there's a more pandas-proper line of code that ends in .fillna("No score") .
I tried to use a loc statement, but I get hung up on searching the values of one dataframe in a column that contains lists.
Can anyone shed some light?
df2=df2.explode('Numbers')#Explode df2 on Numbers
d=dict(zip(df2.Numbers, df2.Score))#dict Numbers and Scores
df1['Score']=df1.numsearch.map(d).fillna('No Score')#Map dict to df1 filling NaN with No Score
Can shorten it as follows:
df2=df2.explode('Numbers')#Explode df2 on Numbers
df1['Score']=df1.numsearch.map(dict(zip(df2.Numbers, df2.Score))).fillna('No Score')
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No Score
4 Fr 19 No Score
5 Sa 6 2.2
6 Su 101 No Score
You can try left join and fillna:
df1.merge(df2.explode('Numbers'),
left_on='numsearch',
right_on='Numbers', how='left')[['Day', 'numsearch', 'Score']].fillna("No score")
Output:
Day numsearch Score
0 M 1 1.1
1 Tu 20 3.3
2 W 14 4.4
3 Th 99 No score
4 Fr 19 No score
5 Sa 6 2.2
6 Su 101 No score

pandas long (very long + index) to wide format conversion

I have a single column within a dataframe which comprises both the index (virus) and the data to tabulate and wish to convert to wide format.
Input data
virus1
AGCTGAGTGAG # sequence
40.1 # score 1
23 # score 2
102 # score 3
virus2
AGCTGAGTGAG # sequence
43.4 # score 1
32 # score 2
101 # score 3
virus3
AGTTGAGTGAG # sequence
41.3 # score 1
35 # score 2
100 # score 3
.... >100 inputs
Dataframe output
sequence score1 score2 score3
virus1 AGCTGAGTGAG 40.1 43.4 41.3
virus2 AGCTGAGTGAG 23 32 35
virus3 AGTTGAGTGAG 102 101 100
I attempted to import the data into a single dataframe and move the rows into columns of a new dataframe
Code
df = pd.read_csv(file, sep='\n', header=None)
index_labels = df.iloc[::4].astype(str)
dfvirus = pd.DataFrame(index=labels)
dfvirus['sequence'] = df.iloc[1::5].astype(str)
dfvirus['score1'] = df.iloc[2::5].astype(float)
dfvirus['score2'] = df.iloc[3::5].astype(int)
dfvirus['score3'] = df.iloc[4::5].astype(int)
The above didn't work I get NaN or nan for the values of e.g. dfvirus['sequence'].head() depending on whether the input is a number or a string. I could do this by constructing a hierarchical index, but that would mean looping a very long index into a list.
Moving from long to wide format is a common issue and I would be grateful if you could show a simpler solution or where I'm going wrong here.
You can do:
df = pd.read_csv(file, sep='\n', header=None)
new_df = pd.DataFrame(df.values.reshape(-1,5),
columns=['virus','sequence','score1','score2','score3']
)
Output
virus sequence score1 score2 score3
0 virus1 AGCTGAGTGAG 40.1 23 102
1 virus2 AGCTGAGTGAG 43.4 32 101
2 virus3 AGTTGAGTGAG 41.3 35 100

Pandas dataframe not correct format for groupby, what is wrong?

I am trying to sum all columns based on the value of the first, but groupby.sum is unexpectedly not working.
Here is a minimal example:
import pandas as pd
data = [['Alex',10, 11],['Bob',12, 10],['Clarke',13, 9], ['Clarke',1, 1]]
df = pd.DataFrame(data,columns=['Name','points1', 'points2'])
print(df)
df.groupby('Name').sum()
print(df)
I get this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 13 9
3 Clarke 1 1
And not this:
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
From what i understand, the dataframe is not the right format for pandas to perform group by. I would like to understand what is wrong with it because this is just a toy example but i have the same problem with a real data-set.
The real data i'm trying to read is the John Hopkins University Covid-19 dataset:
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series
You forget assign output of aggregation to variable, because aggregation not working inplace. So in your solution print (df) before and after groupby returned same original DataFrame.
df1 = df.groupby('Name', as_index=False).sum()
print (df1)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10
Or you can set to same variable df:
df = df.groupby('Name', as_index=False).sum()
print (df)
Name points1 points2
0 Alex 10 11
1 Bob 12 10
2 Clarke 14 10

If there a Python function to sum all of the columns of a particular row? If not, what would be the best way to go about this? [duplicate]

I'm going through the Khan Academy course on Statistics as a bit of a refresher from my college days, and as a way to get me up to speed on pandas & other scientific Python.
I've got a table that looks like this from Khan Academy:
| Undergraduate | Graduate | Total
-------------+---------------+----------+------
Straight A's | 240 | 60 | 300
-------------+---------------+----------+------
Not | 3,760 | 440 | 4,200
-------------+---------------+----------+------
Total | 4,000 | 500 | 4,500
I would like to recreate this table using pandas. Of course I could create a DataFrame using something like
"Graduate": {...},
"Undergraduate": {...},
"Total": {...},
But that seems like a naive approach that would both fall over quickly and just not really be extensible.
I've got the non-totals part of the table like this:
df = pd.DataFrame(
{
"Undergraduate": {"Straight A's": 240, "Not": 3_760},
"Graduate": {"Straight A's": 60, "Not": 440},
}
)
df
I've been looking and found a couple of promising things, like:
df['Total'] = df.sum(axis=1)
But I didn't find anything terribly elegant.
I did find the crosstab function that looks like it should do what I want, but it seems like in order to do that I'd have to create a dataframe consisting of 1/0 for all of these values, which seems silly because I've already got an aggregate.
I have found some approaches that seem to manually build a new totals row, but it seems like there should be a better way, something like:
totals(df, rows=True, columns=True)
or something.
Does this exist in pandas, or do I have to just cobble together my own approach?
Or in two steps, using the .sum() function as you suggested (which might be a bit more readable as well):
import pandas as pd
df = pd.DataFrame( {"Undergraduate": {"Straight A's": 240, "Not": 3_760},"Graduate": {"Straight A's": 60, "Not": 440},})
#Total sum per column:
df.loc['Total',:] = df.sum(axis=0)
#Total sum per row:
df.loc[:,'Total'] = df.sum(axis=1)
Output:
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500
append and assign
The point of this answer is to provide an in line and not an in place solution.
append
I use append to stack a Series or DataFrame vertically. It also creates a copy so that I can continue to chain.
assign
I use assign to add a column. However, the DataFrame I'm working on is in the in between nether space. So I use a lambda in the assign argument which tells Pandas to apply it to the calling DataFrame.
df.append(df.sum().rename('Total')).assign(Total=lambda d: d.sum(1))
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500
Fun alternative
Uses drop with errors='ignore' to get rid of potentially pre-existing Total rows and columns.
Also, still in line.
def tc(d):
return d.assign(Total=d.drop('Total', errors='ignore', axis=1).sum(1))
df.pipe(tc).T.pipe(tc).T
Graduate Undergraduate Total
Not 440 3760 4200
Straight A's 60 240 300
Total 500 4000 4500
From the original data using crosstab, if just base on your input, you just need melt before crosstab
s=df.reset_index().melt('index')
pd.crosstab(index=s['index'],columns=s.variable,values=s.value,aggfunc='sum',margins=True)
Out[33]:
variable Graduate Undergraduate All
index
Not 440 3760 4200
Straight A's 60 240 300
All 500 4000 4500
Toy data
df=pd.DataFrame({'c1':[1,2,2,3,4],'c2':[2,2,3,3,3],'c3':[1,2,3,4,5]})
# before `agg`, I think your input is the result after `groupby`
df
Out[37]:
c1 c2 c3
0 1 2 1
1 2 2 2
2 2 3 3
3 3 3 4
4 4 3 5
pd.crosstab(df.c1,df.c2,df.c3,aggfunc='sum',margins
=True)
Out[38]:
c2 2 3 All
c1
1 1.0 NaN 1
2 2.0 3.0 5
3 NaN 4.0 4
4 NaN 5.0 5
All 3.0 12.0 15
The original data is:
>>> df = pd.DataFrame(dict(Undergraduate=[240, 3760], Graduate=[60, 440]), index=["Straight A's", "Not"])
>>> df
Out:
Graduate Undergraduate
Straight A's 60 240
Not 440 3760
You can only use df.T to achieve recreating this table:
>>> df_new = df.T
>>> df_new
Out:
Straight A's Not
Graduate 60 440
Undergraduate 240 3760
After computing the Total by row and columns:
>>> df_new.loc['Total',:]= df_new.sum(axis=0)
>>> df_new.loc[:,'Total'] = df_new.sum(axis=1)
>>> df_new
Out:
Straight A's Not Total
Graduate 60.0 440.0 500.0
Undergraduate 240.0 3760.0 4000.0
Total 300.0 4200.0 4500.0

Resources