how to apply pandas get_dummies function to valid data set? - python-3.x

I tried to apply pandas get_dummies function to my dataset.
The problem is category value's number is not matched train set and valid set.
For example, train set column has 5 kind of values. ex : [1, 2, 3, 4, 5]
However, valid set has just 3 kind of values. ex : [1, 3, 5]
When I made model by using train dataset there were 5 dummies is being created.
ex: dum_1, dum_2, dum_3, dum_4, dum_5
So, if i just used same function for valid data set this will be made only 3 dummies will be created.
ex: dum_1, dum_2, dum_3
It is not possible to predict valid data set to use my model.
How to make same dummies for train and valid set?
(It is not possible to concat 2 dataset. Please suggest another method except using pd.concat)
Also, if I add new column for valid set, I expect it will make different result.
because dummies sequence is not matching between train and valid set.
thanks.

All you need to do is
Create columns in the validation dataset which are present in the training data but missing in the validation data.
missing_cols = [col for col in train.columns if col not in valid.columns]
for col in missing_cols:
valid[col] = 0
Now, these columns are created in the end, so the order of the columns would be changed. Thus in the next step we would rearrange the columns as below:
valid = valid[[train.columns]]

Related

Deleting a word in column based on frequencies

I have a NLP project where I would like to remove the words that appear only once in the keywords. That is to say, for each row I have a list of keywords and their frequencies.
I would like something like
if the frequency for the word in the whole column ['keywords'] ==1 then replace by "".
I cannot test word by word. So my idea was creating a list with all the words and remove the duplicates, then for each word in this list count.sum and then delete. But I have no idea how to do that.
Any ideas? Thanks!
Here's how my data looks like:
sample.head(4)
ID keywords age sex
0 1 fibre:16;quoi:1;dangers:1;combien:1;hightech:1... 62 F
1 2 restaurant:1;marrakech.shtml:1 35 M
2 3 payer:1;faq:1;taxe:1;habitation:1;macron:1;qui... 45 F
3 4 rigaud:3;laurent:3;photo:11;profile:8;photopro... 46 F
To add on to what #jpl mentioned with scikit-learn's CountVectorizer, there exists an option min_df that does exactly what you want, provided you can get your data in the right format. Here's an example:
from sklearn.feature_extraction.text import CountVectorizer
# assuming you want the token to appear in >= 2 documents
vectorizer = CountVectorizer(min_df=2)
documents = ['hello there', 'hello']
X = vectorizer.fit_transform(documents)
This gives you:
# Notice the dimensions of our array – 2 documents by 1 token
>>> X.shape
(2, 1)
# Here is a count of how many times the tokens meeting the inclusion
# criteria are observed in each document (as you see, "hello" is seen once
# in each document
>>> X.toarray()
array([[1],
[1]])
# this is the entire vocabulary our vectorizer knows – see how "there" is excluded?
>>> vectorizer.vocabulary_
{'hello': 0}
Your representation makes that difficult.
You should build a dataframe where each column is a word; then you can use easily pandas operations like the sum to do whatever you want.
However this will lead to a very sparse dataframe, which is never good.
Many libraries, e.g. scikit learn's CountVectorizer allow you to do what you want efficiently.

Input formatting for models such as logistic regression and KNN for Python

In my training set I have 24 Feature Vectors(FV). Each FV contains 2 lists. When I try to fit this on model = LogisticRegression() or model = KNeighborsClassifier(n_neighbors=k) I get this error ValueError: setting an array element with a sequence.
In my dataframe, each row represents each FV. There are 3 columns. The first column contains a list of an individual's heart rate, second a list of the corresponding activity data and third the target. Visually, it looks like something like this:
HR ACT Target
[0.5018, 0.5106, 0.4872] [0.1390, 0.1709, 0.0886] 1
[0.4931, 0.5171, 0.5514] [0.2423, 0.2795, 0.2232] 0
Should I:
Join both lists to form on long FV
Expand both lists such that each column represents one value. In other words, if there are 5 items in HR and ACT data for a FV, the new dataframe would have 10 columns for features and 1 for Target.
How does Logistic Regression and KNNs handle input data? I understand that logistic regression combines the input linearly using weights or coefficient values. But I am not sure what that means when it comes to lists VS dataframe columns. Does it mean it automatically converts corresponding values of dataframe columns to a list before transforming? Is there a difference between method 1 and 2?
Additionally, if a long list is required, should I have the long list as [HR,HR,HR,ACT,ACT,ACT] or [HR,ACT,HR,ACT,HR,ACT].
You should go with 2
Expand both lists such that each column represents one value. In other words, if there are 5 items in HR and ACT data for a FV, the new dataframe would have 10 columns for features and 1 for Target.
You should then select the feature columns from the dataframe and pass it as X, and the target column as Y to the model's fit function.
Sklearn's models accepts inputs with the following shape [n_samples, n_features], and since after following the 2nd solution you proposed, your training dataframe will have 2D of the shape [n_samples, 10].

Creating a Multilevel dataframe row by row

So, I have set some functions that retrieve data, and my idea is to create a DataFrame with the following structure.
Multi Level index, having 3 index named 'Date','Competition','Match'.
Multi Level column, in which I have 2 levels, with 2 values in the upper level and the same 8 column names for each one.
My guess is the best approach is looping to get every row and save it in a list, so once finished you only have to create the dataframe, but I'm having difficulties on how to actually do it.
To create the frame por the dataframe I do as follows
indx=['pts','gfa','gco','cs','fts','bts','o25%','po25/bts']
findx=[('h/a stats',x) for x in indx]+[('total stats',y) for y in indx]
index=pd.MultiIndex.from_tuples(findx, names=['tipo', 'stat'])
index2=pd.MultiIndex.from_tuples([('date','competition','match')])
If I just do
fframe=pd.DataFrame(index=index2,columns=index)
>>[1 rows x 16 columns]
Which is OK, the frame has the desired structure, but if I try adding a dummy row from the beginning to check if it works
r=['11-12-11','ARG1','Blois v Gries',1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
fframe=pd.DataFrame(r,index=index2,columns=index)
>>ValueError: Shape of passed values is (1, 19), indices imply (16, 1)
What am I missing? Why doesn't populate the dataframe? How should this be accomplished?

How to identify the categorical variables in the 200+ numerical variables?

I have a dataset which has 200+ numerical variables (type:int). In those variables, a few are categorical variables having values like (0,1),(0,1,2,3,4) etc.
I need to identify these categorical variables and dummify them.
Identifying and dummifying them takes a lot of time - is there any way to do it easily?
You could say that some variables are categorical or treat them as categorical by the length of their unique values. For instance if a variable has only unique values [-2,4,56] you could treat this variable as categorical.
import pandas as pd
import numpy as np
col = [c for c in train.columns if c not in ['id','target']]
numclasses=[]
for c in col:
numclasses.append(len(np.unique(train[[c]])))
threshold=10
categorical_variables = list(np.array(col2)[np.array(numclasses2)<threshold]
Every unique value in every variable treated as categorical will create a new column. If you want not to many columns to be created later as dummies, you can use small threshold.
Use nunique() functions to get number of unique values in each column and then filter the columns. Use your best judgement to initialize threshold value. Convert the features into categorical type
category_features = []
threshold = 10
for each in df.columns:
if df[each].nunique() < threshold:
category_features.append(each)
for each in category_features:
df[each] = df[each].astype('category')

built in method like pandas.sub() that does row wise subtraction

I am looking for a simple method for doing row wise subtraction on a pandas df. The closest that I can find is df.shift(1) which only works on datetime. So if I have a dataframe
df['col'] = [1,2,3,4,5]
Is there a built in method that will allow me do element wise subtraction so that it returns the following, by subtracting from every element the one on its left. The first element will stay as is.
sd['col'] = [1, 1, 1, 1, 1]
Is there an already existing method that does this or do I have to code it myself ?
In your comment you write
First value is correct. The subsequent values are to be subtracted.
So, say you want to place the differences in a column called diff. Then you could do
df['diff'] = df['col'].diff()
(using pd.DataFrame.diff), which would place the difference in every entry except the first. You can easily mend this with
df['diff'].values[0] = df['col'].values[0]

Resources