Related
I have the below data
df1
Hema shiva Ishan
0 22 30 33
1 34 32 21
2 20 12 14
3 26 14 18
4 12 28 17
5 30 11 22
6 18 15 18
7 19 18 19
8 22 20 32
I wanted to take ratio of first column value with rest of the columns , eg first column should divide by 22 , 2nd column 30 and 3rd columns by 33 .
The answer is below .
Please help me if I missing something
Just divide the first row by the DF:
df.iloc[0] / df
I am trying to perform a multi-class text classification using SVM with a small dataset by adapting from this guide. The input csv contains a 'text' column and a 'label' column (which have been manually assigned for this specific task).
One label needs to be assigned for each text entry. By using the LinearSVC model and TfidfVectorizer I obtained an accuracy score of 75% which seems more than expected for a very small dataset of only 400 samples. In order to further raise the accuracy I wanted to have a look at the entries that were not correctly classified but here I have an issue. Since I used train_test_split like this:
Train_X, Test_X, Train_Y, Test_Y = train_test_split(X, y, test_size=0.1, random_state = 1004)
I don't know which text entries have been used by the train_test_split function (as far as I understand the function chooses randomly the 10% entries for the test_size). So I don't know against which subset of the corpus original entries labels should I compare the list of predicted labels for the test dataset. In other words is there a method to enforce a subset to be assigned for the test_size i.e the last 40 entries from the 400 total entries in the dataset?
This would help to manually compare the predicted labels vs the ground truth labels.
Below is the code:
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import os
class Config:
# Data and output directory config
data_path = r'./take3/Data'
code_train = r'q27.csv'
if __name__ == "__main__":
print('--------Code classification--------\n')
Corpus = pd.read_csv(os.path.join(Config.data_path, Config.code_train), sep = ',', encoding='cp1252', usecols=['text', 'label'])
train_text = ['' if type(t) == float else t for t in Corpus['text'].values]
# todo fine tunining
tfidf = TfidfVectorizer(
sublinear_tf=True,
min_df=3, norm='l2',
encoding='latin-1',
ngram_range=(1, 2),
stop_words='english')
X = tfidf.fit_transform(train_text) # Learn vocabulary and idf, return document-term matrix.
# print('Array mapping from feature integer indices to feature name',tfidf.get_feature_names())
print('X.shape:', X.shape)
y = np.array(list(Corpus['label']))
print('The corpus original labels:',y)
print('y.shape:', y.shape)
Train_X, Test_X, Train_Y, Test_Y = train_test_split(X, y, test_size=0.1, random_state = 1004)
model = LinearSVC(random_state=1004)
model.fit(Train_X, Train_Y)
SVM_predict_test = model.predict(Test_X)
accuracy = accuracy_score(Test_Y, SVM_predict_test, normalize=True, sample_weight=None)*100
print('Predicted labels for the test dataset', SVM_predict_test)
print("SVM accuracy score: {:.4f}".format(accuracy))
And this is the received output:
--------Code classification--------
X.shape: (400, 136)
The corpus original labels: [15 20 9 14 98 12 3 4 4 22 99 3 98 20 99 1 10 20 8 15 98 12 18 7
20 99 8 8 13 2 8 6 22 4 98 5 98 12 18 8 98 18 24 4 3 19 12 5
20 6 8 15 5 14 19 22 16 10 24 16 98 8 8 16 2 20 4 8 20 6 22 98
3 98 15 12 2 13 5 8 8 1 10 16 20 12 7 20 98 22 99 10 12 8 8 16
16 4 4 99 20 8 16 2 12 15 16 10 5 22 8 7 7 4 5 12 16 14 1 10
22 20 4 4 5 99 16 3 5 22 99 5 3 4 4 3 6 99 8 20 2 10 98 6
6 8 99 3 8 99 2 5 15 6 6 7 8 14 9 4 20 3 99 5 98 15 5 5
20 10 4 99 99 16 22 8 10 22 98 12 3 5 9 99 14 8 9 18 20 14 15 20
20 1 6 23 22 20 6 1 18 8 12 10 15 10 6 10 3 4 8 24 14 22 5 3
22 24 98 98 98 4 15 19 5 8 1 17 16 6 22 19 4 8 2 15 12 99 16 8
9 1 8 22 14 5 20 2 10 10 22 12 98 3 19 5 98 14 19 22 18 16 98 16
6 4 24 98 24 98 15 1 3 99 5 10 22 4 16 98 22 1 8 4 20 8 8 5
20 4 3 20 22 4 20 12 7 21 5 4 16 8 22 20 99 5 6 99 8 3 4 99
6 8 12 3 10 4 8 5 14 20 6 99 4 4 6 4 98 21 1 23 20 98 19 6
4 22 98 98 20 10 8 10 19 16 14 98 14 12 10 4 22 14 3 98 10 20 98 10
9 7 3 8 3 6 6 98 8 99 1 20 18 8 2 6 99 99 99 14 14 16 20 99
1 98 23 6 12 4 1 3 99 99 3 22 5 7 16 99]
y.shape: (400,)
Predicted labels for the test dataset [ 1 8 5 4 15 10 14 12 6 8 8 16 98 20 7 99 99 12 99 24 4 98 99 3
20 3 6 14 18 98 99 22 4 99 4 10 14 4 3 98]
SVM accuracy score: 75.0000
The default behavior of train_test_split is to split data into random train and test subsets. You can enforce a static subset split by setting shuffle=False and removing random_state.
Train_X, Test_X, Train_Y, Test_Y = train_test_split(X, y, test_size=0.1, shuffle=False)
See How to get a non-shuffled train_test_split in sklearn
I have a dataframe, but I'm trying to add a new column which is a list of the column names in order of their values, for each row.
Searching has proved to be difficult, as the search terms have so much in common with doing a column sort overall. Instead, I'm trying to customize the list for each row.
df = pd.DataFrame([
["a",88,3,78,8,40 ],
["b",100,20,29,13,91 ],
["c",77,92,42,72,58 ],
["d",39,53,69,7,40 ],
["e",26,62,77,33,86 ],
["f",94,5,28,96,7 ]
], columns=['id','x1','x2','x3','x4','x5'])
have = df.set_index('id')
+----+-----+----+----+----+----+----------------------------+
| id | x1 | x2 | x3 | x4 | x5 | ordered_cols |
+----+-----+----+----+----+----+----------------------------+
| a | 88 | 3 | 78 | 8 | 40 | ['x2','x4','x5','x3','x1'] |
| b | 100 | 20 | 29 | 13 | 91 | ['x4','x2','x3','x5','x1'] |
| c | 77 | 92 | 42 | 72 | 58 | … |
| d | 39 | 53 | 69 | 7 | 40 | … |
| e | 26 | 62 | 77 | 33 | 86 | … |
| f | 94 | 5 | 28 | 96 | 7 | … |
+----+-----+----+----+----+----+----------------------------+
try stack with sort_values and groupby
assuming your dataframe is called df
df["sorted_cols"] = (
df.stack().sort_values().reset_index(1).groupby(level=0)["level_1"].agg(list)
)
print(df)
x1 x2 x3 x4 x5 sorted_cols
id
a 88 3 78 8 40 [x2, x4, x5, x3, x1]
b 100 20 29 13 91 [x4, x2, x3, x5, x1]
c 77 92 42 72 58 [x3, x5, x4, x1, x2]
d 39 53 69 7 40 [x4, x1, x5, x2, x3]
e 26 62 77 33 86 [x1, x4, x2, x3, x5]
f 94 5 28 96 7 [x2, x5, x3, x1, x4]
The solution by Manakin will be the fastest option, because it is a vectorized.
Use pandas.DataFrame.apply with axis=1, and a list comprehension to sort the column names by the row values.
The list comprehension is from SO: Sorting list based on values from another list, and does not require importing any additional packages.
import pandas as pd
# add the new column
df['ordered_cols'] = df.apply(lambda y: [x for _, x in sorted(zip(y, df.columns))], axis=1)
# display(df)
x1 x2 x3 x4 x5 ordered_cols
id
a 88 3 78 8 40 [x2, x4, x5, x3, x1]
b 100 20 29 13 91 [x4, x2, x3, x5, x1]
c 77 92 42 72 58 [x3, x5, x4, x1, x2]
d 39 53 69 7 40 [x4, x1, x5, x2, x3]
e 26 62 77 33 86 [x1, x4, x2, x3, x5]
f 94 5 28 96 7 [x2, x5, x3, x1, x4]
Here is a simple one line solution using apply and np.argsort :
import numpy as np
have["ordered_cols"] = have.apply(lambda row: have.columns[np.argsort(row.values)].values, axis=1)
have
Hay,
you can try looping over the rows and sorting the values in each row. The code below will do the trick:
ordered_cols = []
for index, row in have.iterrows():
ordered_cols.append(list(have.sort_values(by=index, ascending=True, axis=1).columns))
have['ordered_cols'] = ordered_cols
have
Output:
x1 x2 x3 x4 x5 ordered_cols
id
a 88 3 78 8 40 [x2, x4, x5, x3, x1]
b 100 20 29 13 91 [x4, x2, x3, x5, x1]
c 77 92 42 72 58 [x3, x5, x4, x1, x2]
d 39 53 69 7 40 [x4, x1, x5, x2, x3]
e 26 62 77 33 86 [x1, x4, x2, x3, x5]
f 94 5 28 96 7 [x2, x5, x3, x1, x4]
I hope this was helpful.
Cheers!
I have been using the clustergram feature in Matlab on my data in the following way;
Cobj2 = clustergram(c,'RowLabels',[1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40],'ColumnLabels',{'Value1','Value2','Value3','Value4','Value5','Value6'},'Colormap',redbluecmap,'Standardize',1)
Matlab sorts my data into clusters and although I have found the dendogram and heatmap informative I would like to find out a little more information about my clusters for example the euclidean distances between them or some other measure so I can determine their 'strength'. Is there a way to get more information or statistics about my graph?
I'm a power excel pivot table user who is forcing himself to learn R. I know exactly how to do this analysis in excel, but can't figure out the right way to code this in R.
I'm trying to group user data by 2 different variables, while grouping the variables into ranges (or bins), then summarizing other variables.
Here is what the data looks like:
userid visits posts revenue
1 25 0 25
2 2 2 0
3 86 7 8
4 128 24 94
5 30 5 18
… … … …
280000 80 10 100
280001 42 4 25
280002 31 8 17
Here is what I am trying to get the output to look like:
VisitRange PostRange # of Users Total Revenue Average Revenue
0 0 X Y Z
1-10 0 X Y Z
11-20 0 X Y Z
21-30 0 X Y Z
31-40 0 X Y Z
41-50 0 X Y Z
> 50 0 X Y Z
0 1-10 X Y Z
1-10 1-10 X Y Z
11-20 1-10 X Y Z
21-30 1-10 X Y Z
31-40 1-10 X Y Z
41-50 1-10 X Y Z
> 50 1-10 X Y Z
want to group by visits and posts by 10 up to a certain level, then group anything higher than 50 as '> 51'
I've looked a tapply and ddply as ways to accomplish this, but I don't think they will work the way I am expecting, but I could be wrong.
Lastly, I know I could do this in SQL using and if/then statement to identify the range of visits and the range of posts (for example - If visits between 1 and 10, then '1-10'), then just group by visit range and post range, but my goal here is to start forcing myself to use R. Maybe R isn't the right tool here, but I think it is…
All help would be appreciated. Thanks in advance.
The idiom in the plyr package and ddply in particular, is very similar to pivot tables in Excel.
In your example, the only thing you need to do is the cut your grouping variables into the desired breaks, before passing to ddply. Here is an example:
First, create some sample data:
set.seed(1)
dat <- data.frame(
userid = 1:500,
visits =sample(0:50, 500, replace=TRUE),
posts = sample(0:50, 500, replace=TRUE),
revenue = sample(1:100, replace=TRUE)
)
Now, use cut to divide your grouping variables into the desired ranges:
dat$PostRange <- cut(dat$posts, breaks=seq(0, 50, 10), include.lowest=TRUE)
dat$VisitRange <- cut(dat$visits, breaks=seq(0, 50, 10), include.lowest=TRUE)
Finally, use ddply with summarise:
library(plyr)
ddply(dat, .(VisitRange, PostRange),
summarise,
Users=length(userid),
`Total Revenue`=sum(revenue),
`Average Revenue`=mean(revenue))
The results:
VisitRange PostRange Users Total Revenue Average Revenue
1 [0,10] [0,10] 23 1318 57.30435
2 [0,10] (10,20] 23 1136 49.39130
3 [0,10] (20,30] 28 1499 53.53571
4 [0,10] (30,40] 20 923 46.15000
5 [0,10] (40,50] 14 826 59.00000
6 (10,20] [0,10] 23 1227 53.34783
7 (10,20] (10,20] 17 642 37.76471
8 (10,20] (20,30] 20 888 44.40000
9 (10,20] (30,40] 15 622 41.46667
10 (10,20] (40,50] 21 968 46.09524
11 (20,30] [0,10] 23 1226 53.30435
12 (20,30] (10,20] 19 1021 53.73684
13 (20,30] (20,30] 23 1380 60.00000
14 (20,30] (30,40] 8 313 39.12500
15 (20,30] (40,50] 19 710 37.36842
16 (30,40] [0,10] 18 782 43.44444
17 (30,40] (10,20] 25 1308 52.32000
18 (30,40] (20,30] 14 553 39.50000
19 (30,40] (30,40] 26 1131 43.50000
20 (30,40] (40,50] 20 1295 64.75000
21 (40,50] [0,10] 20 958 47.90000
22 (40,50] (10,20] 21 1168 55.61905
23 (40,50] (20,30] 20 1118 55.90000
24 (40,50] (30,40] 20 1009 50.45000
25 (40,50] (40,50] 20 934 46.70000