I have several binary outcome variables (see below), that I'd like to encode into a single column y that can be used for a multilabel keras model in python.
has_carinsurance has_lifeinsurance has_petinsurance
1 0 1
So far, I've created the column y using
df['y'] = df[['has_carinsurance','has_lifeinsurance','has_petinsurance']].values.tolist()
How would I convert this to an array suitable for multilabel classification in python while making the original labels retrievable?
Related
I tried to apply pandas get_dummies function to my dataset.
The problem is category value's number is not matched train set and valid set.
For example, train set column has 5 kind of values. ex : [1, 2, 3, 4, 5]
However, valid set has just 3 kind of values. ex : [1, 3, 5]
When I made model by using train dataset there were 5 dummies is being created.
ex: dum_1, dum_2, dum_3, dum_4, dum_5
So, if i just used same function for valid data set this will be made only 3 dummies will be created.
ex: dum_1, dum_2, dum_3
It is not possible to predict valid data set to use my model.
How to make same dummies for train and valid set?
(It is not possible to concat 2 dataset. Please suggest another method except using pd.concat)
Also, if I add new column for valid set, I expect it will make different result.
because dummies sequence is not matching between train and valid set.
thanks.
All you need to do is
Create columns in the validation dataset which are present in the training data but missing in the validation data.
missing_cols = [col for col in train.columns if col not in valid.columns]
for col in missing_cols:
valid[col] = 0
Now, these columns are created in the end, so the order of the columns would be changed. Thus in the next step we would rearrange the columns as below:
valid = valid[[train.columns]]
In my training set I have 24 Feature Vectors(FV). Each FV contains 2 lists. When I try to fit this on model = LogisticRegression() or model = KNeighborsClassifier(n_neighbors=k) I get this error ValueError: setting an array element with a sequence.
In my dataframe, each row represents each FV. There are 3 columns. The first column contains a list of an individual's heart rate, second a list of the corresponding activity data and third the target. Visually, it looks like something like this:
HR ACT Target
[0.5018, 0.5106, 0.4872] [0.1390, 0.1709, 0.0886] 1
[0.4931, 0.5171, 0.5514] [0.2423, 0.2795, 0.2232] 0
Should I:
Join both lists to form on long FV
Expand both lists such that each column represents one value. In other words, if there are 5 items in HR and ACT data for a FV, the new dataframe would have 10 columns for features and 1 for Target.
How does Logistic Regression and KNNs handle input data? I understand that logistic regression combines the input linearly using weights or coefficient values. But I am not sure what that means when it comes to lists VS dataframe columns. Does it mean it automatically converts corresponding values of dataframe columns to a list before transforming? Is there a difference between method 1 and 2?
Additionally, if a long list is required, should I have the long list as [HR,HR,HR,ACT,ACT,ACT] or [HR,ACT,HR,ACT,HR,ACT].
You should go with 2
Expand both lists such that each column represents one value. In other words, if there are 5 items in HR and ACT data for a FV, the new dataframe would have 10 columns for features and 1 for Target.
You should then select the feature columns from the dataframe and pass it as X, and the target column as Y to the model's fit function.
Sklearn's models accepts inputs with the following shape [n_samples, n_features], and since after following the 2nd solution you proposed, your training dataframe will have 2D of the shape [n_samples, 10].
My features are the normalized rgb values that is it contains values in the range of 0 to 0.1.I have declared the training matrix to be CV_64FC1. It contains 1000 rows and 60 columns but with decimal values like 0.3333 or 0.2789. Now I read in the OpenCv docs that the training matrix has to be of float type but my matrix is of type double. How to give this matrix to the SVM for training without converting to float
Try casting your variables to float type
I use datas in excel to produce a graphic.
Then I make a regression, and have an equation. I'd like to know what value would be obtained from the regression (for example, x = 7,6 is the value for which I wanna know an estimation of y).
It is an approximation with a 6 degree polynome.
One wimple method would be this : I have the equation, so I could use it
However, I wondered if there is a fast method to do it? Like I enter 7,6 somewhere to have the result quickly?
if you are looking at a linear regression line (straight line) you could try the forecast formula
=forecast(X, Known Ys, Known Xs)
you could also build your own equation automatically from
=linest(...)
I found the following on a site describing the capabilities of the linest function in excel:
In addition to using LOGEST to calculate statistics for other
regression types, you can use LINEST to calculate a range of other
regression types by entering functions of the x and y variables as the
x and y series for LINEST. For example, the following formula:
=LINEST(yvalues, xvalues^COLUMN($A:$C))
works when you have a single column of y-values and a single column of
x-values to calculate the cubic (polynomial of order 3) approximation
of the form:
y = m1*x + m2*x^2 + m3*x^3 + b
You can adjust this formula to calculate other types of regression,
but in some cases it requires the adjustment of the output values and
other statistics.
or look at:
=trend
I want to represent each text-based item I have in my system as a vector in vector space model. The values for the terms can be negative or positive that reflect the frequency of a term in the positive or negative class. The zero value means neutral
for example:
Item1 (-1,0,-5,4.5,2)
Item2 (2,6,0,-4,0.5)
My questions are:
1- How can I normalize my vectors to a range of [0 to 1] where:
.5 means zero before normalization
and .5> if it is positive
.5< if it negative
I want to know if there is a mathematical formula to do such a thing.
2- Will similarity measure choice be different after the normalization?? For example can I use Cosine similarity?
3- Will it be difficult if I preform dimensionality reduction after the normalization??
Thanks in advance
One solution could be to use the MinMaxScaler which scales the number between (0, 1) range and then divide each row by the sum of the row. In python using sklearn you can do something like this:
from sklearn.preprocessing import MinMaxScaler, normalize
scaler = MinMaxScaler()
scaled_X = scaler.fit_transform(X)
normalized_X = normalize(scaled_X, norm='l1', axis=1, copy=True)