count items across all columns using pandas method - scikit-learn

I have this dataframe and I can get the count of each item per row using vectorizer. But this works correctly for single row (for e.g. col1). How do I apply it to entire dataframe (all 3 columns)?
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
CV = CountVectorizer()
cv_matrix=CV.fit_transform(df['col1'].values)
ndf = pd.SparseDataFrame(cv_matrix)
ndf.columns = CV.get_feature_names()
X = ndf.fillna("0")
The results are correct for single column:
apple rice
0 1 0
1 0 1
2 1 0
3 0 1
4 1 0
Expected Results for all columns:
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 0 0
3 0 1 0 2 0
4 1 0 1 1 0
Is there any other way to get the same results?

You can stack and get dummies. Then take the max by index (sum if you want counts instead of dummies)
pd.get_dummies(df.stack()).max(level=0)
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 0 1
3 0 0 0 1 1
4 1 1 0 1 0
Alternatively, get_dummies on the entire DataFrame with blank prefixes and group along the columns axis.
pd.get_dummies(df, prefix='', prefix_sep='').max(level=0, axis=1)

If you find creating a new column combining all the individual columns as an overhead, you can use generators, which can let you fit large data.
Also, the recommended way of reading sparse matrix in pandas dataframe is sparse.from_spmatrix. Read more here
cv = CountVectorizer()
pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(
' '.join(x) for x in shopping_list),
columns=cv.get_feature_names())
If you need to apply CountVectorizer in Dataframe, then use
' '.join(x[1:]) for x in df.itertuples()

You can create a separate column by joining all the existing columns and apply CountVectorizer on it. Please refer to the sample code below:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Red Chillies", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
vocab = set(df.values.flatten())
v = [i.lower() for i in vocab]
df['new'] = df.apply(' '.join, axis=1)
So your new dataframe will look like this
col1 col2 col3 new
0 Apple Bread Fridge Apple Bread Fridge
1 Rice Bread Milk Rice Bread Milk
2 Apple Rice Bread Apple Rice Bread
3 Rice Milk Milk Rice Milk Milk
4 Red Chillies Bread Milk Red Chillies Bread Milk
Now you can apply CountVectorizer on the new column as shown below:
CV = CountVectorizer(vocabulary=vocab, , ngram_range=(1,5))
cv_matrix=CV.fit_transform(df.new)
And you can get your desired dataframe using:
pd.DataFrame(cv_matrix.toarray(), columns= CV.get_feature_names())
bread milk fridge rice apple red chillies
0 1 0 1 0 1 0
1 1 1 0 1 0 0
2 1 0 0 1 1 0
3 0 2 0 1 0 0
4 1 1 0 0 0 1

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

Reverse the Multi label binarizer in pandas

I have pandas dataframe as
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
# load sample data
df = pd.DataFrame( {'user_id':['1','1','2','2','2','3'], 'fruits':['banana','orange','orange','apple','banana','mango']})
I collect all the fruits for each user using below code -
# collect fruits for each user
transformed_df= df.groupby('user_id').agg({'fruits':lambda x: list(x)}).reset_index()
print(transformed_df)
user_id fruits
0 1 [banana, orange]
1 2 [orange, apple, banana]
2 3 [mango]
Once I get this list, I do multilabel-binarizer operation to convert this list into ones or zeroes
# perform MultiLabelBinarizer
final_df = transformed_df.join(pd.DataFrame(mlb.fit_transform(transformed_df.pop('fruits')),columns=mlb.classes_,index=transformed_df.index))
print(final_df)
user_id apple banana mango orange
0 1 0 1 0 1
1 2 1 1 0 1
2 3 0 0 1 0
Now, I have a requirement wherein, the input dataframe given to me is final_df and I need to get back the transformed_df which contains the list of fruits for each user.
How can I get this transformed_df back , given that I have final_df as input dataframe?
I am trying to get this working
# Trying to get this working
inverse_df = final_df.join(pd.DataFrame(mlb.inverse_transform(final_df.loc[:, final_df.columns != 'user_id'].as_matrix())))
inverse_df
user_id apple banana mango orange 0 1 2
0 1 0 1 0 1 banana orange None
1 2 1 1 0 1 apple banana orange
2 3 0 0 1 0 mango None None
But it doesnt give me the list back.
inverse_transform() method should help. Here's the documentation - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer.inverse_transform.

Splitting Column Lists in Pandas DataFrame

I'm looking for an good way to solve the following problem. My current fix is not particularly clean, and I'm hoping to learn from your insight.
Suppose I have a Panda DataFrame, whose entries look like this:
>>> df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
>>> df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
>>> df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
>>> df['IsGlass']=[1,0,1]
>>> df
Color Texture IsGlass
1 NaN ['Rough'] 1
2 ['Red', 'Blue'] NaN 0
3 ['Blue', 'Green', 'Purple'] ['Silky','Shiny','Fuzzy'] 1
So each observation in the index corresponds to something I measured about its color, texture, and whether it's glass or not. What I'd like to do is turn this into a new "indicator" DataFrame, by creating a column for each observed value, and changing the corresponding entry to a one if I observed it, and NaN if I have no information.
>>> df
Red Blue Green Purple Rough Silky Shiny Fuzzy Is Glass
1 Nan Nan Nan Nan 1 NaN Nan Nan 1
2 1 1 Nan Nan Nan Nan Nan Nan 0
3 Nan 1 1 1 Nan 1 1 1 1
I have solution that loops over each column, looks at its values, and through a series of Try/Excepts for non-Nan values splits the lists, creates a new column, etc., and concatenates.
This is my first post to StackOverflow - I hope this post conforms to the posting guidelines. Thanks.
Stacking Hacks!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.stack().unstack(fill_value=[])
def b(c):
d = mlb.fit_transform(c)
return pd.DataFrame(d, c.index, mlb.classes_)
pd.concat([b(df[c]) for c in ['Color', 'Texture']], axis=1).join(df.IsGlass)
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
I am just using pandas, get_dummies
l=[pd.get_dummies(df[x].apply(pd.Series).stack(dropna=False)).sum(level=0) for x in ['Color','Texture']]
pd.concat(l,axis=1).assign(IsGlass=df.IsGlass)
Out[662]:
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
For each texture/color in each row, I check if the value is null. If not, we add that value as a column = 1 for that row.
import numpy as np
import pandas as pd
df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
df['IsGlass']=[1,0,1]
for row in df.itertuples():
if not np.all(pd.isnull(row.Color)):
for val in row.Color:
df.loc[row.Index,val] = 1
if not np.all(pd.isnull(row.Texture)):
for val in row.Texture:
df.loc[row.Index,val] = 1

create dummies from a column for a subset of data, which does't contains all the category value in that column

I am handling a subset of the a large data set.
There is a column named "type" in the dataframe. The "type" are expected to have values like [1,2,3,4].
In a certain subset, I find the "type" column only contains certain values like [1,4],like
In [1]: df
Out[2]:
type
0 1
1 4
When I create dummies from column "type" on that subset, it turns out like this:
In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]: type_1 type_4
0 1 0
1 0 1
It does't have the columns named "type_2", "type_3".What i want is like:
Out[6]: type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Is there a solution for this?
What you need to do is make the column 'type' into a pd.Categorical and specify the categories
pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Another solution with reindex_axis and add_prefix:
df1 = pd.get_dummies(df["type"])
.reindex_axis([1,2,3,4], axis=1, fill_value=0)
.add_prefix('type')
print (df1)
type1 type2 type3 type4
0 1 0 0 0
1 0 0 0 1
Or categorical solution:
df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
type_1 type_2 type_3 type_4
0 1 0 0 0
1 0 0 0 1
Since you tagged your post as one-hot-encoding, you may find sklearn module's OneHotEncoder useful, in addition to pure Pandas solutions:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5
# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))
# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])
print(newdf)
type_0 type_1 type_2 type_3 type_4
0 0 1 0 0 0
1 0 0 0 0 1
One advantage of using this approach is that OneHotEncoder easily produces sparse vectors, for very large class sets. (Just change to sparse=True in the OneHotEncoder() declaration.)

Resources