How to merge back k-means clustering outputs to the corresponding units in a dataframe - python-3.x

I just wonder this strategy is the correct way to merge back the k-means clustering outputs to the corresponding units in the existing dataframe.
For example, I have a data set which includes user ID, age, income, gender and I want to run a k-means clustering algorithm to find a set of clusters where each cluster has similar users in terms of these characteristics (age, income, gender). Note that I disregard the value difference among the characteristics for the brevity.
existing_dataframe
user_id age income gender
1 13 10 1 (female)
2 34 50 1
3 75 40 0 (male)
4 23 29 0
5 80 45 1
... ... ... ...
existing_dataframe_for_analysis
(Based on my understanding after referring number of tutorials from online sources,
I should not include user_id variable, so I use the below dataframe for the analysis;
please let me know if I am wrong)
age income gender
13 10 1 (female)
34 50 1
75 40 0 (male)
23 29 0
80 45 1
... ... ... ...
Assume that I found the optimal number of clusters from the dataset is 3. So I decided to set it as 3 and predict in which cluster each user is categorized using the below code.
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3,
init='k-means++',
max_iter=20,
n_init=10)
model.fit(existing_dataframe_for_analysis)
predicted=model.predict(existing_dataframe_for_analysis)
print (predicted[:5])
The expected out can be shown below:
[0 1 2 1 2]
If I run the below code where I create a new column called 'cluster' which represents the analysis outputs and add the colum to the existing dataframe, does it gaurantee that nth element from the output list corresponds to the nth observation (user id) in the existing dataframe? Please advice.
existing_dataframe['cluster']=predicted
print (existing_dataframe)
output:
user_id age income gender cluster
1 13 10 1 (female) 0
2 34 50 1 1
3 75 40 0 (male) 2
4 23 29 0 1
5 80 45 1 2
... ... ... ... ...

Your approach to rejoin the predictions is correct. Your assumption to not include any ids is also correct. However, I strongly advise you to scale your input variables before doing any clustering, as your variables have different units.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(existing_dataframe_for_analysis)
Then continue working with this new object as you did before.

Related

Using FuzzyWuzzy with pandas

I am trying to calculate the similarity between cities in my dataframe, and 1 static city name. (eventually I want to iterate through a dataframe and choose the best matching city name from that data frame, but I am testing my code on this simplified scenario).
I am using fuzzywuzzy token set ratio.
For some reason it calculates the first row correctly, and it seems it assigns the same value for all rows.
code:
from fuzzywuzzy import fuzz
test_df= pd.DataFrame( {"City" : ["Amsterdam","Amsterdam","Rotterdam","Zurich","Vienna","Prague"]})
test_df = test_df.assign(Score = lambda d: fuzz.token_set_ratio("amsterdam",test_df["City"]))
print (test_df.shape)
test_df.head()
Result:
City Score
0 Amsterdam 100
1 Amsterdam 100
2 Rotterdam 100
3 Zurich 100
4 Vienna 100
If I do the comparison one by one it works:
print (fuzz.token_set_ratio("amsterdam","Amsterdam"))
print (fuzz.token_set_ratio("amsterdam","Rotterdam"))
print (fuzz.token_set_ratio("amsterdam","Zurich"))
print (fuzz.token_set_ratio("amsterdam","Vienna"))
Results:
100
67
13
13
Thank you in advance!
I managed to solve it via iterating through the rows:
for index,row in test_df.iterrows():
test_df.loc[index, "Score"] = fuzz.token_set_ratio("amsterdam",test_df.loc[index,"City"])
The result is:
City Country Code Score
0 Amsterdam NL 100
1 Amsterdam NL 100
2 Rotterdam NL 67
3 Zurich NL 13
4 Vienna NL 13

taking top 3 in a groupby, and lumping rest into 'other category'

I am currently doing a groupby in pandas like this:
df.groupby(['grade'])['students'].nunique())
and the result I get is this:
grade
grade 1 12
grade 2 8
grade 3 30
grade 4 2
grade 5 600
grade 6 90
Is there a way to get the output such that I see the groups of the top 3, and everything else is classified under other?
this is what I am looking for
grade
grade 3 30
grade 5 600
grade 6 90
other (3 other grades) 22
I think you can add a helper column in the df and call it something like "Grouping".
name the top 3 rows with its original name and name the remaining as "other" and then just group by the "Grouping" column.
Can't do much without the actual input data, but if this is your starting dataframe (df) after your groupby -
grade unique
0 grade_1 12
1 grade_2 8
2 grade_3 30
3 grade_4 2
4 grade_5 600
5 grade_6 90
You can do a few more steps to get to your table -
ddf = df.nlargest(3, 'unique')
ddf = ddf.append({'grade': 'Other', 'unique':df['unique'].sum()-ddf['unique'].sum()}, ignore_index=True)
grade unique
0 grade_5 600
1 grade_6 90
2 grade_3 30
3 Other 22

pandas get rows from one dataframe which are existed in other dataframe

I have two dataframes. The dataframes as follows:
df1 is
numbers
user_id
0 9154701244
1 9100913773
2 8639988041
3 8092118985
4 8143131334
5 9440609551
6 8309707235
7 8555033317
8 7095451372
9 8919206985
10 8688960416
11 9676230089
12 7036733390
13 9100914771
it's shape is (14,1)
df2 is
user_id numbers names type duration date_time
0 9032095748 919182206378 ramesh incoming 23 233445445
1 9032095748 918919206983 suresh incoming 45 233445445
2 9032095748 919030785187 rahul incoming 45 233445445
3 9032095748 916281206641 jay incoming 67 233445445
4 jakfnka998nknk 9874654411 query incoming 25 8571228412
5 jakfnka998nknk 9874654112 form incoming 42 678565487
6 jakfnka998nknk 9848022238 json incoming 10 89547212765
7 ukajhj9417fka 9984741215 keert incoming 32 8548412664
8 ukajhj9417fka 9979501984 arun incoming 21 7541344646
9 ukajhj9417fka 95463241 paru incoming 42 945151215451
10 ukajknva939o 7864621215 hari outgoing 34 49829840920
and it's shape is (10308,6)
Here in df1, the column name numbers are having the multiple unique numbers. These numbers are available in df2 and those are repeated depends on the duration. I want to get those data who all are existed in df2 based on the numbers which are available in df1.
Here is the code I've tried to get this but I'm not able to figure it out how it can be solved using pandas.
df = pd.concat([df1, df2]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
df = df.reindex(idx)
It gives me only unique numbers column which are there in df2. But I need to get all the data including other columns from the second dataframe.
It would be great that anyone can help me on this. Thanks in advance.
Here is a sample dataframe, I created keeping the gist same.
df1=pd.DataFrame({"numbers":[123,1234,12345,5421]})
df2=pd.DataFrame({"numbers":[123,1234,12345,123,123,45643],"B":[1,2,3,4,5,6],"C":[2,3,4,5,6,7]})
final_df=df2[df2.numbers.isin(df1.numbers)]
Output DataFrame The result is all unique numbers that are present in df1 and present in df2 will be returned
numbers B C
0 123 1 2
1 1234 2 3
2 12345 3 4
3 123 4 5
4 123 5 6

Averaging duplicates values in dataframe [duplicate]

I have a dataframe like this:
cluster org time
1 a 8
1 a 6
2 h 34
1 c 23
2 d 74
3 w 6
I would like to calculate the average of time per org per cluster.
Expected result:
cluster mean(time)
1 15 #=((8 + 6) / 2 + 23) / 2
2 54 #=(74 + 34) / 2
3 6
I do not know how to do it in Pandas, can anybody help?
If you want to first take mean on the combination of ['cluster', 'org'] and then take mean on cluster groups, you can use:
In [59]: (df.groupby(['cluster', 'org'], as_index=False).mean()
.groupby('cluster')['time'].mean())
Out[59]:
cluster
1 15
2 54
3 6
Name: time, dtype: int64
If you want the mean of cluster groups only, then you can use:
In [58]: df.groupby(['cluster']).mean()
Out[58]:
time
cluster
1 12.333333
2 54.000000
3 6.000000
You can also use groupby on ['cluster', 'org'] and then use mean():
In [57]: df.groupby(['cluster', 'org']).mean()
Out[57]:
time
cluster org
1 a 438886
c 23
2 d 9874
h 34
3 w 6
I would simply do this, which literally follows what your desired logic was:
df.groupby(['org']).mean().groupby(['cluster']).mean()
Another possible solution is to reshape the dataframe using pivot_table() then take mean(). Note that it's necessary to pass aggfunc='mean' (this averages time by cluster and org).
df.pivot_table(index='org', columns='cluster', values='time', aggfunc='mean').mean()
Another possibility is to use level parameter of mean() after the first groupby() to aggregate:
df.groupby(['cluster', 'org']).mean().mean(level='cluster')

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Resources