create dummies from a column for a subset of data, which does't contains all the category value in that column - python-3.x

I am handling a subset of the a large data set.
There is a column named "type" in the dataframe. The "type" are expected to have values like [1,2,3,4].
In a certain subset, I find the "type" column only contains certain values like [1,4],like
In [1]: df
Out[2]:
type
0 1
1 4
When I create dummies from column "type" on that subset, it turns out like this:
In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]: type_1 type_4
0 1 0
1 0 1
It does't have the columns named "type_2", "type_3".What i want is like:
Out[6]: type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Is there a solution for this?

What you need to do is make the column 'type' into a pd.Categorical and specify the categories
pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1

Another solution with reindex_axis and add_prefix:
df1 = pd.get_dummies(df["type"])
.reindex_axis([1,2,3,4], axis=1, fill_value=0)
.add_prefix('type')
print (df1)
type1 type2 type3 type4
0 1 0 0 0
1 0 0 0 1
Or categorical solution:
df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1

Since you tagged your post as one-hot-encoding, you may find sklearn module's OneHotEncoder useful, in addition to pure Pandas solutions:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5
# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))
# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])
print(newdf)
type_0 type_1 type_2 type_3 type_4
0 0 1 0 0 0
1 0 0 0 0 1
One advantage of using this approach is that OneHotEncoder easily produces sparse vectors, for very large class sets. (Just change to sparse=True in the OneHotEncoder() declaration.)

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

How to check if column is binary? (Pandas)

How to (efficiently!) check if a column is binary ?
"col" "col2"
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
also there might be a problem with columns that arent meant to be binary,
but only include zeros.
(I thought of using a list with their names which is filled after the column is added to the DF,
but is there a way to directly sign a column as "binary" during creation?)
the purpose is featurescaling for machine learning. (binarys shouldnt be scaled)
If want filter columns names with 0 or 1 values:
c = df.columns[df.isin([0,1]).all()]
print (c)
Index(['col', 'col2'], dtype='object')
If need filter columns:
df1 = df.loc[:, df.isin([0,1]).all()]
print (df1)
col col2
0 0 1
1 0 0
2 0 0
3 0 0
4 0 1
you can use this:
pd.unique(df[['col', 'col2']].values.ravel('K'))
and it returns:
array([0, 1], dtype=int64)
or you can use also pd.unique for each column
That's what I use to also cover all corner cases with mixed string/numeric types
import numpy as np
import pandas as pd
def checkBinary(ser, dropna = False):
try:
if dropna:
ser = pd.to_numeric(ser.dropna(), errors="raise") #With a safety reminder that errors must be raised
else:
ser = pd.to_numeric(ser, errors="raise")
except:
return False
return {0,1} == set(pd.unique(ser))
ser = pd.Series(["0",1,"1.000", np.nan])
checkBinary(ser, dropna = True)
>> True
ser = pd.Series(["0",0,"0.000"])
checkBinary(ser)
>> False

Is there anyway to make more than one dummies variable at a time? [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

Faster way to count number of timestamps before another timestamp

I have two dataframe "train" and "log". "log" has datetime columns "time1" while train has datetime column "time2". For every row in "train" I want to find out counts of "time1" when "time1" is before "time2".
I already tried the apply method with dataframe.
def log_count(row):
return sum((log['user_id'] == row['user_id']) & (log['time1'] < row['time2']))
train.apply(log_count, axis = 1)
It is taking very long with this approach.
Since you want to do this once for each (paired) user_id group, you could do the following:
Create a column called is_log which is 1 in log and 0 in train:
log['is_log'] = 1
train['is_log'] = 0
The is_log column will be used to keep track of whether or not a row comes from log or train.
Concatenate the log and train DataFrames:
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
Sort the combined DataFrame by user_id and time:
combined = combined.sort_values(by=["user_id", "time"])
So now combined looks something like this:
time user_id is_log
6 2000-01-17 0 0
0 2000-03-13 0 1
1 2000-06-08 0 1
7 2000-06-25 0 0
4 2000-07-09 0 1
8 2000-07-18 0 0
10 2000-03-13 1 0
5 2000-04-16 1 0
3 2000-08-04 1 1
9 2000-08-17 1 0
2 2000-10-20 1 1
Now the count that you are looking for can be expressed as a cumulative sum of the is_log column, grouped by user_id:
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
This is the main idea: Counting the number of 1s in the is_log column is equivalent to counting the number of times in log which come before each time in train.
For example,
import numpy as np
import pandas as pd
np.random.seed(2019)
def random_dates(N):
return np.datetime64("2000-01-01") + np.random.randint(
365, size=N
) * np.timedelta64(1, "D")
N = 5
log = pd.DataFrame({"time1": random_dates(N), "user_id": np.random.randint(2, size=N)})
train = pd.DataFrame(
{
"time2": np.r_[random_dates(N), log.loc[0, "time1"]],
"user_id": np.random.randint(2, size=N + 1),
}
)
log["is_log"] = 1
train["is_log"] = 0
combined = pd.concat(
[log.rename(columns=dict(time1="time")), train.rename(columns=dict(time2="time"))],
axis=0,
ignore_index=True,
sort=False,
)
combined = combined.sort_values(by=["user_id", "time"])
combined["count"] = combined.groupby("user_id")["is_log"].cumsum()
train = combined.loc[combined["is_log"] == 0]
print(log)
# time1 user_id is_log
# 0 2000-03-13 0 1
# 1 2000-06-08 0 1
# 2 2000-10-20 1 1
# 3 2000-08-04 1 1
# 4 2000-07-09 0 1
print(train)
yields
time user_id is_log count
6 2000-01-17 0 0 0
7 2000-06-25 0 0 2
8 2000-07-18 0 0 3
10 2000-03-13 1 0 0
5 2000-04-16 1 0 0
9 2000-08-17 1 0 1

Pandas DataFrame: create a matrix-like with 0 and 1

i have to create a matrix-like with 0 and 1. How can i create something like that?
This is my DataFrame:
I want to check the intersection where df['luogo'] is 'sala' and df['sala'] and replace it with 1.
This is my try:
for head in dataframe.columns:
for i in dataframe['luogo']:
if i == head:
dataframe[head] = 1
else:
dataframe[head] = 0
Sorry for the italian dataframe.
You are probably looking for pandas.get_dummies(..) [pandas-doc]. For a given dataframe df:
>>> df
luogo
0 sala
1 scuola
2 teatro
3 sala
We get:
>>> pd.get_dummies(df['luogo'])
sala scuola teatro
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
You thus can join this with your original dataframe with:
>>> df.join(pd.get_dummies(df['luogo']))
luogo sala scuola teatro
0 sala 1 0 0
1 scuola 0 1 0
2 teatro 0 0 1
3 sala 1 0 0
This thus constructs a "one hot encoding" [wiki] of the values in your original dataframe.

Resources