How to categorize positive and negative features from top features - python-3.x

I have trained user reviews thru average tfidf wor2vec model and got top features. Would like to tag top features as positive & negative.
Could you please suggest.
def top_tfidf_feats(row, features, top_n=1):
''' Get top n tfidf values in row and return them with their corresponding feature names.'''
topn_ids = np.argsort(row)[::-1][:top_n]
top_feats = [(features[i], row[i]) for i in topn_ids]
df = pd.DataFrame(top_feats)
df.columns = ['feature', 'tfidf']
return df
top_tfidf = top_tfidf_feats(final_tf_idf[1,:].toarray()[0],tfidf_feat,10)
Top 10 features...
feature tfidf
------- ------
0 urgent 0.513783
1 tells 0.501945
2 says 0.490708
3 clear 0.424756
4 care 0.206723
5 not 0.141886
6 flanum 0.000000
7 flap 0.000000
8 flare 0.000000
9 flared 0.000000
10 flares 0.000000

Related

How to compare values in a data frame in pandas [duplicate]

I am trying to calculate the biggest difference between summer gold medal counts and winter gold medal counts relative to their total gold medal count. The problem is that I need to consider only countries that have won at least 1 gold medal in both summer and winter.
Gold: Count of summer gold medals
Gold.1: Count of winter gold medals
Gold.2: Total Gold
This a sample of my data:
Gold Gold.1 Gold.2 ID diff gold %
Afghanistan 0 0 0 AFG NaN
Algeria 5 0 5 ALG 1.000000
Argentina 18 0 18 ARG 1.000000
Armenia 1 0 1 ARM 1.000000
Australasia 3 0 3 ANZ 1.000000
Australia 139 5 144 AUS 0.930556
Austria 18 59 77 AUT 0.532468
Azerbaijan 6 0 6 AZE 1.000000
Bahamas 5 0 5 BAH 1.000000
Bahrain 0 0 0 BRN NaN
Barbados 0 0 0 BAR NaN
Belarus 12 6 18 BLR 0.333333
This is the code that I have but it is giving the wrong answer:
def answer():
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
df2['difference'] = (df2['Gold']-df2['Gold.1']).abs()/df2['Gold.2']
return df2['diff gold %'].idxmax()
answer()
Try this code after subbing in the correct (your) function and variable names. I'm new to Python, but I think the issue was that you had to use the same variable in Line 4 (df1['difference']), and just add the method (.idxmax()) to the end. I don't think you need the first line of code for the function, either, as you don't use the local variable (Gold_Y). FYI - I don't think we're working with the same dataset.
def answer_three():
df1['difference'] = (df1['Gold']-df1['Gold.1']).abs()/df1['Gold.2']
return df1['difference'].idxmax()
answer_three()
def answer_three():
atleast_one_gold = df[(df['Gold']>1) & (df['Gold.1']> 1)]
return ((atleast_one_gold['Gold'] - atleast_one_gold['Gold.1'])/atleast_one_gold['Gold.2']).idxmax()
answer_three()
def answer_three():
_df = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
return ((_df['Gold'] - _df['Gold.1']) / _df['Gold.2']).argmax() answer_three()
This looks like a question from the programming assignment of courser course -
"Introduction to Data Science in Python"
Having said that if you are not cheating "maybe" the bug is here:
Gold_Y = df2[(df2['Gold'] > 1) | (df2['Gold.1'] > 1)]
You should use the & operator. The | operator means you have countries that have won Gold in either the Summer or Winter olympics.
You should not get a NaN in your diff gold.
def answer_three():
diff=df['Gold']-df['Gold.1']
relativegold = diff.abs()/df['Gold.2']
df['relativegold']=relativegold
x = df[(df['Gold.1']>0) &(df['Gold']>0) ]
return x['relativegold'].idxmax(axis=0)
answer_three()
I an pretty new to python or programming as a whole.
So my solution would be the most novice ever!
I love to create variables; so you'll see a lot in the solution.
def answer_three:
a = df.loc[df['Gold'] > 0,'Gold']
#Boolean masking that only prints the value of Gold that matches the condition as stated in the question; in this case countries who had at least one Gold medal in the summer seasons olympics.
b = df.loc[df['Gold.1'] > 0, 'Gold.1']
#Same comment as above but 'Gold.1' is Gold medals in the winter seasons
dif = abs(a-b)
#returns the abs value of the difference between a and b.
dif.dropna()
#drops all 'Nan' values in the column.
tots = a + b
#i only realised that this step wasn't essential because the data frame had already summed it up in the column 'Gold.2'
tots.dropna()
result = dif.dropna()/tots.dropna()
returns result.idxmax
# returns the index value of the max result
def answer_two():
df2=pd.Series.max(df['Gold']-df['Gold.1'])
df2=df[df['Gold']-df['Gold.1']==df2]
return df2.index[0]
answer_two()
def answer_three():
return ((df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold'] - df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.1'])/df[(df['Gold']>0) & (df['Gold.1']>0 )]['Gold.2']).argmax()

groupby and ranking based on the string in one column

I am working on a data frame, which contains 70 over actions. I have a column that groups those 70 actions. I want to create a new column that is the rank of string from an existing column. The following the sample of the data frame:
DF = pd.DataFrame()
DF ['template']= ['Attk','Attk','Attk','Attk','Attk','Attk','Def','Def','Def','Def','Def','Def','Accuracy','Accuracy','Accuracy','Accuracy','Accuracy','Accuracy']
DF ['Stats'] = ['Goal','xG','xA','Goal','xG','xA','Block','interception','tackles','Block','interception','tackles','Acc.passes','Acc.actions','Acc.crosses','Acc.passes','Acc.actions','Acc.crosses']
DF=DF.sort_values(['template','Stats'])
The new column that I wanted to create is groupby [template] and ranking the Stats alphabetical order.
The expected data frame is as follow:
I have 10 to 15 of Stats under each of the template.
Use GroupBy.transform with lambda function and factorize, also because python counts from 0 is added 1:
f = lambda x: pd.factorize(x)[0]
DF['Order'] = DF.groupby('template')['Stats'].transform(f) + 1
print (DF)
template Stats Order
13 Accuracy Acc.actions 1
16 Accuracy Acc.actions 1
14 Accuracy Acc.crosses 2
17 Accuracy Acc.crosses 2
12 Accuracy Acc.passes 3
15 Accuracy Acc.passes 3
0 Attk Goal 1
3 Attk Goal 1
2 Attk xA 2
5 Attk xA 2
1 Attk xG 3
4 Attk xG 3
6 Def Block 1
9 Def Block 1
7 Def interception 2
10 Def interception 2
8 Def tackles 3
11 Def tackles 3

how to compare strings in two data frames in pandas(python 3.x)?

I have two DFs like so:
df1:
ProjectCode ProjectName
1 project1
2 project2
3 projc3
4 prj4
5 prjct5
and df2 as
VillageName
v1
proj3
pro1
prjc3
project1
What I have to do is compare each ProjectName with VillageName and also add the percentage of matching. The percentage to be calculated as:
No. of matching characters/total characters * 100
The Village data i.e. df2 has more than 10 million records and the Project data i.e. df1 contains around 1200 records.
What I have done so far:
import pandas as pd
df1 = pd.read_excel("C:\\Users\\Desktop\\distinctVillage.xlsx")
df = pd.read_excel("C:\\Users\\Desktop\\awcProjectMaster.xlsx")
for idx, row in df.iteritems():
for idx1, row1 in df1.iteritems():
I don't know how to proceed with this. How to find substring and get third df having percentage match with each string. I think it is not feasible since each record from Project will have matching with each value of Village which will produce a huge result.
Is there any better way to find out which project names are matching with which village names and how good is the match?
Expected output:
ProjectName VillageName charactersMatching PercentageMatch
project1 v1 1 whateverPercent
project1 proj3 4 whateverPercent
The expected output can be changed depending on the feasibility and solution.
The following code assumes you don't care about repeated characters (since it's taking the set on both sides).
percentage_match = df1['ProjectName'].apply(lambda x: df2['VillageName'].apply(lambda y: len(set(y).intersection(set(x))) / len(set(x+y))))
Output:
0 1 2 3 4
ProjectCode
1 0.111111 0.444444 0.500000 0.444444 1.000000
2 0.000000 0.444444 0.333333 0.444444 0.777778
3 0.000000 0.833333 0.428571 0.833333 0.555556
4 0.000000 0.500000 0.333333 0.500000 0.333333
5 0.000000 0.375000 0.250000 0.571429 0.555556
If you want the 'best match' for each Project:
percentage_match.idxmax(axis = 1)
Output:
1 4
2 4
3 1
4 1
5 3

Merge distance matrix results and original indices with Python Pandas

I have a panda df with list of bus stops and their geolocations:
stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125
stop_id isn't necessarily incremental.
Using sklearn.metrics.pairwise.manhattan_distances I calculate distances and get a symmetric distance matrix:
array([[0. , 1.412176, 2.33437 , 3.422297, 5.24705 ],
[1.412176, 0. , 1.151232, 2.047153, 4.165126],
[2.33437 , 1.151232, 0. , 1.104079, 3.143274],
[3.422297, 2.047153, 1.104079, 0. , 2.175247],
[5.24705 , 4.165126, 3.143274, 2.175247, 0. ]])
But I can't manage to easily connect between the two now. I want to have a df that contains a tuple for each pair of stops and their distance, something like:
stop_id_1 stop_id_2 distance
1 2 3.33
I tried working with the lower triangle, convert to vector and all sorts but I feel I just over-complicate things with no success.
Hope this helps!
d= ''' stop_id stop_lat stop_lon
0 1 32.183939 34.917812
1 2 31.870034 34.819541
2 3 31.984553 34.782828
3 4 31.888550 34.790904
4 6 31.956576 34.898125 '''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+')
from sklearn.metrics.pairwise import manhattan_distances
distance_df = pd.DataFrame(manhattan_distances(df))
distance_df.index = df.stop_id.values
distance_df.columns = df.stop_id.values
print(distance_df)
output:
1 2 3 4 6
1 0.000000 1.412176 2.334370 3.422297 5.247050
2 1.412176 0.000000 1.151232 2.047153 4.165126
3 2.334370 1.151232 0.000000 1.104079 3.143274
4 3.422297 2.047153 1.104079 0.000000 2.175247
6 5.247050 4.165126 3.143274 2.175247 0.000000
Now, to create the long format of the same df, use the following.
long_frmt_dist=distance_df.unstack().reset_index()
long_frmt_dist.columns = ['stop_id_1', 'stop_id_2', 'distance']
print(long_frmt_dist.head())
output:
stop_id_1 stop_id_2 distance
0 1 1 0.000000
1 1 2 1.412176
2 1 3 2.334370
3 1 4 3.422297
4 1 6 5.247050
df_dist = pd.DataFrame.from_dict(dist_matrix)
pd.merge(df, df_dist, how='left', left_index=True, right_index=True)
example

Reindexing multi-index dataframes in Pandas

I am trying to reindex one multi-index dataframe based on another multi-index dataframe. For singly-indexed dfs, this works:
index1 = range(3, 7)
index2 = range(1, 11)
values = [np.random.random() for x in index1]
df = pd.DataFrame(values, index=index1, columns=["values"])
print(df)
print(df.reindex(index2, fill_value=0))
Output:
values
3 0.458003
4 0.945828
5 0.783369
6 0.784599
values
1 0.000000
2 0.000000
3 0.458003
4 0.945828
5 0.783369
6 0.784599
7 0.000000
8 0.000000
9 0.000000
10 0.000000
New rows are added, based on index2, and the value for y is set to 0. This is what I expect.
Now, let's try something similar for a multi-index df:
data_dict = {
"scan": 1,
"x": [2,3,5,7,8,9],
"y": [np.random.random() for x in range(1,7)]
}
index1 = ["scan", "x"]
df = pd.DataFrame.from_dict(data_dict).set_index(index)
print(df)
index2 = list(range(4, 13))
print(df.reindex(index2, level="x").fillna(0))
Output:
y
scan x
1 2 0.771531
3 0.451761
5 0.434075
7 0.135785
8 0.309137
9 0.838330
y
scan x
1 5 0.434075
7 0.135785
8 0.309137
9 0.838330
What gives? The output is different than the input: the first two values have been removed. But the other values - intermediate (e.g., 4) or larger (e.g., 10 or higher) - are not there. What am I missing?
The actual dataframes have 6 index levels and tens to hundreds of rows, but I think this code captures the problem. I spent a little time looking at df.realign, df.join, and a lot of time scouring SO, but I haven't found a solution. Apologies if it's a duplicate!
Let me suggest a workaround:
print(df.reindex(pd.MultiIndex.from_product([df.index.get_level_values(0).unique(), index2], names=['scan', 'x'])).fillna(0))
y
scan x
1 4 0.000000
5 0.718190
6 0.000000
7 0.612991
8 0.609323
9 0.991806
10 0.000000
11 0.000000
12 0.000000
Building on #Sergey's workaround, here's what I ended up with. I expanded the example to have more levels, more closely replicating my own data.
Generate a df:
data_dict = {
"sample": "A",
"scan": 1,
"meas_time": datetime.now(),
"x": [2,3,5,7,8,9],
"y": [np.random.random() for x in range(1,7)]
}
index1 = ["sample", "scan", "meas_time", "x"]
df = pd.DataFrame.from_dict(data_dict).set_index(index1)
print(df)
Try to reindex:
index2 = range(4, 13)
print(df.reindex(labels=index2, level="x").fillna(0))
Implementing Sergey's workaround:
df.reindex(
pd.MultiIndex.from_product(
[df.index.get_level_values("sample").unique(),
df.index.get_level_values("scan").unique(),
df.index.get_level_values("meas_time").unique(),
index2],
names=["sample", "scan", "meas_time", "x"])
).fillna(0)
Notes: if .unique() isn't included, a multiple (product?!?) of the dataframe is calculated for each level. This is likely why my kernel crashed previously; I wasn't including .unique().
This seems like really odd pandas behavior. I also found a workaround which involved chaining .reset_index().set_index("x").reindex("blah").set_index([list]). I'd really like to know why reindexing is treated the way it is.

Resources