I have two data set like below
1st dataset
x1 x2 types
1 3 1
2 4 1
3 5 1
2nd dataset
x1 x2 types
4 8 -1
2 10 -1
3 12 -1
in 1st dataset x2 = 2+x1
amd 2nd x2= 2*x1
how can I train the dataset for SVM in R language
so that if i input another data like(2,4) it will present in class 2
You have to tell svm() which is the class that has the label that describes you data, what is your dataset, and what parameters you want to use.
For example, supposing all of your data is together on a dataframe called "dataset" you could call:
svm(types ~., data = dataset, kernel = "radial", gamma = 0.1, cost = 1)
To check what parameters are better for your problem you can use tune.svm().
tune.svm(types~., data = dataset, gamma = 10^(-6:-1), cost = 10^(-3:1))
Related
I have a situation where I need to print all the distinct values that are there for all the categorical columns in my data frame
The dataframe looks like this :
Gender Function Segment
M IT LE
F IT LM
M HR LE
F HR LM
The output should give me the following:
Variable_Name Distinct_Count
Gender 2
Function 2
Segment 2
How to achieve this?
using nunique then passing the series into a new datafame and setting column names.
df_unique = df.nunique().to_frame().reset_index()
df_unique.columns = ['Variable','DistinctCount']
print(df_unique)
Variable DistinctCount
0 Gender 2
1 Function 2
2 Segment 2
This is not good, yet it won't fail to provide the expected output:
new_data = {'Variable_Name':[],'Distinct_Count':[]}
for i in list(df):
new_data['Variable_Name'].append(i)
new_data['Distinct_Count'].append(df[i].nunique())
new_df = pd.DataFrame(new_data)
print(new_df)
Output:
Variable_Name Distinct_Count
0 Gender 2
1 Function 2
2 Segment 2
I've got a Dataframe like this:
df = pd.DataFrame(np.reshape(np.arange(0,9), (3,3)))
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
I'd like to normalize two of the columns against a reference column. For example, if I chose df[0] as my reference column, then df[1] and df[2] would also have a mean of 3 and a standard deviation of 3.
What's the best way to do this?
You can shift and scale the values in each column by the mean and standard deviation of the reference column ref:
ref = 0
means = df.mean()
stds = df.std()
(df - means + means[ref]) / stds * stds[ref]
I am working on a simple time series linear regression using statsmodels.api.OLS, and am running these regressions on groups of data based on an identifier variable. I have been able to get the grouped regressions working, but am now looking to merge the results of the regressions back into the original dataframe and am getting index errors.
A simplified version of my original dataframe, which we'll call "df" looks like this:
id value time
a 1 1
a 1.5 2
a 2 3
a 2.5 4
b 1 1
b 1.5 2
b 2 3
b 2.5 4
My function to conduct the regressions is as follows:
def ols_reg(df, xcol, ycol):
x = df[xcol]
y = df[ycol]
x = sm.add_constant(x)
model = sm.OLS(y, x, missing='drop').fit()
predictions = model.predict()
return pd.Series(predictions)
I then define a variable that stores the results of conducting this function on my dataset, grouping by the id column. This code is as follows:
var = df.groupby('id').apply(ols_reg,
xcol='time',ycol='value')
This returns a Series of the predicted linear values that has the same length as the original dataset, and looks like the following:
id
a 0 0.5
1 1
2 2.5
3 3
b 0 0.5
1 1
2 2.5
3 3
The column starting with 0.5 (ignore the values; not the actual output) is the column with predicted values from the regression. As the return on the function shows, this is a pandas Series.
I now want to merge these results back into the original dataframe, to look like the following:
id value time results
a 1 1 0.5
a 1.5 2 1
a 2 3 2.5
a 2.5 4 3
b 1 1 0.5
b 1.5 2 1
b 2 3 2.5
b 2.5 4 3
I've tried a number of methods, such as setting a new column in the original dataset equal to the series, but get the following error:
TypeError: incompatible index of inserted column with frame index
Any help on getting these results back into the original dataframe would be greatly appreciated. There are a number of other posts that correspond to this topic, but none of the solutions worked for me in this instance.
UPDATE:
I've solved this with a relatively simple method, in which I converted the series to a list, and just set a new column in the dataframe equal to the list. However, I would be really curious to hear if others have better/different/unique solutions to this problem. Thanks!
To not loose the position when inserting prediction in the missing values you can use this approach, in example:
X_train: The train data is a pandas dataframe corresponding to the known real results (in y_train).
X_test: The test data is a pandas dataframe without corresponding known real results. Need to predict.
y_train: The train data is pandas serie with real known results
Prediction: The prediction is a pandas series object
To get the complete data merged in one pandas dataframe first get the known part together:
# merge train part of the data into a dataframe
X_train = X_train.sort_index()
y_train = y_train.sort_index()
result = pd.concat([X_train,X_test])
# if need to convert numpy array to pandas series:
# prediction = pd.Series(prediction)
# here is the magic
result['specie'][result['specie'].isnull()] = prediction.values
If there is no missing value would do the job.
I have 3 questions:
1)
The confusion matrix for sklearn is as follows:
TN | FP
FN | TP
While when I'm looking at online resources, I find it like this:
TP | FP
FN | TN
Which one should I consider?
2)
Since the above confusion matrix for scikit learn is different than the one I find in other rescources, in a multiclass confusion matrix, what's the structure will be? I'm looking at this post here:
Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative
In that post, #lucidv01d had posted a graph to understand the categories for multiclass. is that category the same in scikit learn?
3)
How do you calculate the accuracy of a multiclass? for example, I have this confusion matrix:
[[27 6 0 16]
[ 5 18 0 21]
[ 1 3 6 9]
[ 0 0 0 48]]
In that same post I referred to in question 2, he has written this equation:
Overall accuracy
ACC = (TP+TN)/(TP+FP+FN+TN)
but isn't that just for binary? I mean, for what class do I replace TP with?
The reason why sklearn has show their confusion matrix like
TN | FP
FN | TP
like this is because in their code, they have considered 0 to be the negative class and one to be positive class. sklearn always considers the smaller number to be negative and large number to positive. By number, I mean the class value (0 or 1). The order depends on your dataset and class.
The accuracy will be the sum of diagonal elements divided by the sum of all the elements.p The diagonal elements are the number of correct predictions.
As the sklearn guide says: "(Wikipedia and other references may use a different convention for axes)"
What does it mean? When building the confusion matrix, the first step is to decide where to put predictions and real values (true labels). There are two possibilities:
put predictions to the columns, and true labes to rows
put predictions to the rows, and true labes to columns
It is totally subjective to decide which way you want to go. From this picture, explained in here, it is clear that scikit-learn's convention is to put predictions to columns, and true labels to rows.
Thus, according to scikit-learns convention, it means:
the first column contains, negative predictions (TN and FN)
the second column contains, positive predictions (TP and FP)
the first row contains negative labels (TN and FP)
the second row contains positive labels (TP and FN)
the diagonal contains the number of correctly predicted labels.
Based on this information I think you will be able to solve part 1 and part 2 of your questions.
For part 3, you just sum the values in the diagonal and divide by the sum of all elements, which will be
(27 + 18 + 6 + 48) / (27 + 18 + 6 + 48 + 6 + 16 + 5 + 21 + 1 + 3 + 9)
or you can just use score() function.
The scikit-learn convention is to place predictions in columns and real values in rows
The scikit-learn convention is to put 0 by default for a negative class (top) and 1 for a positive class (bottom). the order can be changed using labels = [1,0].
You can calculate the overall accuracy in this way
M = np.array([[27, 6, 0, 16], [5, 18,0,21],[1,3,6,9],[0,0,0,48]])
M
sum of diagonal
w = M.diagonal()
w.sum()
99
sum of matrices
M.sum()
160
ACC = w.sum()/M.sum()
ACC
0.61875
I'm faced with a classification problem that seems to lend itself to Naive Bayes Classifier (NBC). However, I have one problem: usually the NBC works by estimating the most likely Class c out of a set of classes C based on an observation x of a random variable X.
In my case I have multiple variables X1, X2 which might or might not share features. Variable X1 could have features (xa,xb,xc) and X2 might have (xc,xd) and another variable X3 might have (xe). Is it possible to construct one classifier, that allows me to classify X1,X2 and X3 simultaneously, in spite of the fact that the features are intersecting or even orthogonal?
The problem could be viewed from another point of view: for certain classes I am missing all data in certain features. Consider the following tables:
Classes = {C1,C2}.
Features = X = {X1,X2,X3}, X1={A,B}, X2={1,2}, X3={Y,N}
Class C1:
X1 X2 X3 #observations
A 1 ? 50
A 2 ? 20
B 1 ? 20
B 2 ? 10
Class C2:
X1 X2 X3 #observations
A 1 Y 20
A 1 N 0
A 2 Y 20
A 2 N 10
B 1 Y 10
B 1 N 20
B 1 Y 10
B 1 N 10
As you can see the feature X3 does not have any bearing on Class C1. No data is available for feature X3 in classifying class C1. Can I make a classifier that classifies X=(A,2,N) into both C1 and C2? How would I calculate the conditional probabilities for the missing data for X3 in class C1?