I have a dataset which has 200+ numerical variables (type:int). In those variables, a few are categorical variables having values like (0,1),(0,1,2,3,4) etc.
I need to identify these categorical variables and dummify them.
Identifying and dummifying them takes a lot of time - is there any way to do it easily?
You could say that some variables are categorical or treat them as categorical by the length of their unique values. For instance if a variable has only unique values [-2,4,56] you could treat this variable as categorical.
import pandas as pd
import numpy as np
col = [c for c in train.columns if c not in ['id','target']]
numclasses=[]
for c in col:
numclasses.append(len(np.unique(train[[c]])))
threshold=10
categorical_variables = list(np.array(col2)[np.array(numclasses2)<threshold]
Every unique value in every variable treated as categorical will create a new column. If you want not to many columns to be created later as dummies, you can use small threshold.
Use nunique() functions to get number of unique values in each column and then filter the columns. Use your best judgement to initialize threshold value. Convert the features into categorical type
category_features = []
threshold = 10
for each in df.columns:
if df[each].nunique() < threshold:
category_features.append(each)
for each in category_features:
df[each] = df[each].astype('category')
Related
I tried to apply pandas get_dummies function to my dataset.
The problem is category value's number is not matched train set and valid set.
For example, train set column has 5 kind of values. ex : [1, 2, 3, 4, 5]
However, valid set has just 3 kind of values. ex : [1, 3, 5]
When I made model by using train dataset there were 5 dummies is being created.
ex: dum_1, dum_2, dum_3, dum_4, dum_5
So, if i just used same function for valid data set this will be made only 3 dummies will be created.
ex: dum_1, dum_2, dum_3
It is not possible to predict valid data set to use my model.
How to make same dummies for train and valid set?
(It is not possible to concat 2 dataset. Please suggest another method except using pd.concat)
Also, if I add new column for valid set, I expect it will make different result.
because dummies sequence is not matching between train and valid set.
thanks.
All you need to do is
Create columns in the validation dataset which are present in the training data but missing in the validation data.
missing_cols = [col for col in train.columns if col not in valid.columns]
for col in missing_cols:
valid[col] = 0
Now, these columns are created in the end, so the order of the columns would be changed. Thus in the next step we would rearrange the columns as below:
valid = valid[[train.columns]]
I have a dataframe of categorical variables. I want to replace all the fields in one column with an arbitrary unique string if the count of that category within the column is less than 100.
So, for example, in column color, if any color appears less than 100 times i want it to be replaced by the string "base"
I tried the below code and tried different things I found on stack overflow.
df['color'] = numpy.where(df.groupby("color").filter(lambda x: len(x) < 100), 'dummy', df['color'])
Operands could not be broadcast together with shapes (45638872,878) () (8765878782788,)
IIUC, you need this,
df.loc[df.groupby('color')['color'].transform('count')<100, 'color']= 'dummy'
In my training set I have 24 Feature Vectors(FV). Each FV contains 2 lists. When I try to fit this on model = LogisticRegression() or model = KNeighborsClassifier(n_neighbors=k) I get this error ValueError: setting an array element with a sequence.
In my dataframe, each row represents each FV. There are 3 columns. The first column contains a list of an individual's heart rate, second a list of the corresponding activity data and third the target. Visually, it looks like something like this:
HR ACT Target
[0.5018, 0.5106, 0.4872] [0.1390, 0.1709, 0.0886] 1
[0.4931, 0.5171, 0.5514] [0.2423, 0.2795, 0.2232] 0
Should I:
Join both lists to form on long FV
Expand both lists such that each column represents one value. In other words, if there are 5 items in HR and ACT data for a FV, the new dataframe would have 10 columns for features and 1 for Target.
How does Logistic Regression and KNNs handle input data? I understand that logistic regression combines the input linearly using weights or coefficient values. But I am not sure what that means when it comes to lists VS dataframe columns. Does it mean it automatically converts corresponding values of dataframe columns to a list before transforming? Is there a difference between method 1 and 2?
Additionally, if a long list is required, should I have the long list as [HR,HR,HR,ACT,ACT,ACT] or [HR,ACT,HR,ACT,HR,ACT].
You should go with 2
Expand both lists such that each column represents one value. In other words, if there are 5 items in HR and ACT data for a FV, the new dataframe would have 10 columns for features and 1 for Target.
You should then select the feature columns from the dataframe and pass it as X, and the target column as Y to the model's fit function.
Sklearn's models accepts inputs with the following shape [n_samples, n_features], and since after following the 2nd solution you proposed, your training dataframe will have 2D of the shape [n_samples, 10].
I have a uniform distribution in a pandas dataframe column with a few NaN values I'd like to replace.
Since the data is uniformly distributed, I decided that I would like to fill the null values with random uniform samples drawn from a range of the column's min and max values. I used the following code to get the random uniform sample:
df_copy['ep'] = df_copy['ep'].fillna(value=np.random.uniform(3, 331))
Of course, using pd.DafaFrame.fillna() replaces all existing NaNs with the same value. I would like each NaN to be a different value. I assume that a for loop could get the job done, but am unsure how to create such a loop to specifically handle these NaN values. Thanks for the help!
If looks like you are doing this on a series (column), but the same implementation would work on a DataFrame:
Sample Data:
series = pd.Series(range(100))
series.loc[2] = np.nan
series.loc[10:15] = np.nan
Solution:
series.mask(series.isnull(), np.random.uniform(3, 331, size=series.shape))
Use boolean indexing with DataFrame.loc:
m = df_copy['ep'].isna()
df_copy.loc[m, 'ep'] = np.random.uniform(3, 331, size=m.sum())
I have a dataframe with several currently-empty columns. I want a fraction of these filled with data drawn from a normal distribution, while all the rest are left blank. So, for example, if 60% of the elements should be blank, then 60% would be, while the other 40% would be filled. I already have the normal distribution, via numpy, but I'm trying to figure out how to choose random rows to fill. Currently, the only way I can think of involves FOR loops, and I would rather avoid that.
Does anyone have any ideas for how I could fill empty elements of a dataframe at random? I have a bit of the code below, for the random numbers.
data.loc[data['ColumnA'] == 'B', 'ColumnC'] = np.random.normal(1000, 500, rowsB).astype('int64')
piRSquared's advice is good. We are left guessing what to solve.
Having just looked through some of the latest unanswered pandas questions there are worse.
import pandas as pd
import numpy as np
#some redundancy here as i make an empty dataframe -pretending i start like you with a Dataframe.
df = pd.DataFrame(index = range(11),columns=list('abcdefg'))
num_cells = np.product(df.shape)
# make a 2-dim array with number from 1 to number cells.
arr =np.arange(1,num_cells+1)
#inplace shuffle - this is the key randomization operation
np.random.shuffle(arr)
arr = arr.reshape(df.shape)
#place the shuffled values, normalized to the number of cells, into my dateframe.
df = pd.DataFrame(index = df.index,columns = df.columns,data=arr/np.float(num_cells))
#use applymap to set keep 40% of cells as ones, the other 60% as nan.
df = df.applymap(lambda x: 1 if x > 0.6 else np.nan)
# now sample a full set from normal distribution
# but when multiplying the nans will cause the sampled value to nullify, whilst the multiply by 1 will retain the sample value.
df * np.random.normal(1000,500,df.shape)
Thus you are left with a random 40% of the cells containing a draw from your normal distribution.
If your dataframe was large you could assume the stability of the uniform rand() function. Here i didn't do that and rather determined explicitly how many cells are above and below the threshold.