Reverse the Multi label binarizer in pandas - python-3.x

I have pandas dataframe as
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
# load sample data
df = pd.DataFrame( {'user_id':['1','1','2','2','2','3'], 'fruits':['banana','orange','orange','apple','banana','mango']})
I collect all the fruits for each user using below code -
# collect fruits for each user
transformed_df= df.groupby('user_id').agg({'fruits':lambda x: list(x)}).reset_index()
print(transformed_df)
user_id fruits
0 1 [banana, orange]
1 2 [orange, apple, banana]
2 3 [mango]
Once I get this list, I do multilabel-binarizer operation to convert this list into ones or zeroes
# perform MultiLabelBinarizer
final_df = transformed_df.join(pd.DataFrame(mlb.fit_transform(transformed_df.pop('fruits')),columns=mlb.classes_,index=transformed_df.index))
print(final_df)
user_id apple banana mango orange
0 1 0 1 0 1
1 2 1 1 0 1
2 3 0 0 1 0
Now, I have a requirement wherein, the input dataframe given to me is final_df and I need to get back the transformed_df which contains the list of fruits for each user.
How can I get this transformed_df back , given that I have final_df as input dataframe?
I am trying to get this working
# Trying to get this working
inverse_df = final_df.join(pd.DataFrame(mlb.inverse_transform(final_df.loc[:, final_df.columns != 'user_id'].as_matrix())))
inverse_df
user_id apple banana mango orange 0 1 2
0 1 0 1 0 1 banana orange None
1 2 1 1 0 1 apple banana orange
2 3 0 0 1 0 mango None None
But it doesnt give me the list back.

inverse_transform() method should help. Here's the documentation - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer.inverse_transform.

Related

Calculation using shifting is not working in a for loop

The problem consist on calculate from a dataframe the column "accumulated" using the columns "accumulated" and "weekly". The formula to do this is accumulated in t = weekly in t + accumulated in t-1
The desired result should be:
weekly accumulated
2 0
1 1
4 5
2 7
The result I'm obtaining is:
weekly accumulated
2 0
1 1
4 4
2 2
What I have tried is:
for key, value in df_dic.items():
df_aux = df_dic[key]
df_aux['accumulated'] = 0
df_aux['accumulated'] = (df_aux.weekly + df_aux.accumulated.shift(1))
#df_aux["accumulated"] = df_aux.iloc[:,2] + df_aux.iloc[:,3].shift(1)
df_aux.iloc[0,3] = 0 #I put this because I want to force the first cell to be 0.
Being df_aux.iloc[0,3] the first row of the column "accumulated".
What I´m doing wrong?
Thank you
EDIT: df_dic is a dictionary with 5 dataframes. df_dic is seen as {0: df1, 1:df2, 2:df3}. All the dataframes have the same size and same columns names. So i do the for loop to do the same calculation in every dataframe inside the dictionary.
EDIT2 : I'm trying doing the computation outside the for loop and is not working.
What im doing is:
df_auxp = df_dic[0]
df_auxp['accumulated'] = 0
df_auxp['accumulated'] = df_auxp["weekly"] + df_auxp["accumulated"].shift(1)
df_auxp.iloc[0,3] = df_auxp.iloc[0,3].fillna(0)
Maybe have something to do with the dictionary interaction...
To solve for 3 dataframes
import pandas as pd
df1 = pd.DataFrame({'weekly':[2,1,4,2]})
df2 = pd.DataFrame({'weekly':[3,2,5,3]})
df3 = pd.DataFrame({'weekly':[4,3,6,4]})
print (df1)
print (df2)
print (df3)
for d in [df1,df2,df3]:
d['accumulated'] = d['weekly'].cumsum() - d.iloc[0,0]
print (d)
The output of this will be as follows:
Original dataframes:
df1
weekly
0 2
1 1
2 4
3 2
df2
weekly
0 3
1 2
2 5
3 3
df3
weekly
0 4
1 3
2 6
3 4
Updated dataframes:
df1:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7
df2:
weekly accumulated
0 3 0
1 2 2
2 5 7
3 3 10
df3:
weekly accumulated
0 4 0
1 3 3
2 6 9
3 4 13
To solve for 1 dataframe
You need to use cumsum and then subtract the value from first row. That will give you the desired result. here's how to do it.
import pandas as pd
df = pd.DataFrame({'weekly':[2,1,4,2]})
print (df)
df['accumulated'] = df['weekly'].cumsum() - df.iloc[0,0]
print (df)
Original dataframe:
weekly
0 2
1 1
2 4
3 2
Updated dataframe:
weekly accumulated
0 2 0
1 1 1
2 4 5
3 2 7

count items across all columns using pandas method

I have this dataframe and I can get the count of each item per row using vectorizer. But this works correctly for single row (for e.g. col1). How do I apply it to entire dataframe (all 3 columns)?
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
CV = CountVectorizer()
cv_matrix=CV.fit_transform(df['col1'].values)
ndf = pd.SparseDataFrame(cv_matrix)
ndf.columns = CV.get_feature_names()
X = ndf.fillna("0")
The results are correct for single column:
apple rice
0 1 0
1 0 1
2 1 0
3 0 1
4 1 0
Expected Results for all columns:
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 0 0
3 0 1 0 2 0
4 1 0 1 1 0
Is there any other way to get the same results?
You can stack and get dummies. Then take the max by index (sum if you want counts instead of dummies)
pd.get_dummies(df.stack()).max(level=0)
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 0 1
3 0 0 0 1 1
4 1 1 0 1 0
Alternatively, get_dummies on the entire DataFrame with blank prefixes and group along the columns axis.
pd.get_dummies(df, prefix='', prefix_sep='').max(level=0, axis=1)
If you find creating a new column combining all the individual columns as an overhead, you can use generators, which can let you fit large data.
Also, the recommended way of reading sparse matrix in pandas dataframe is sparse.from_spmatrix. Read more here
cv = CountVectorizer()
pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(
' '.join(x) for x in shopping_list),
columns=cv.get_feature_names())
If you need to apply CountVectorizer in Dataframe, then use
' '.join(x[1:]) for x in df.itertuples()
You can create a separate column by joining all the existing columns and apply CountVectorizer on it. Please refer to the sample code below:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Red Chillies", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
vocab = set(df.values.flatten())
v = [i.lower() for i in vocab]
df['new'] = df.apply(' '.join, axis=1)
So your new dataframe will look like this
col1 col2 col3 new
0 Apple Bread Fridge Apple Bread Fridge
1 Rice Bread Milk Rice Bread Milk
2 Apple Rice Bread Apple Rice Bread
3 Rice Milk Milk Rice Milk Milk
4 Red Chillies Bread Milk Red Chillies Bread Milk
Now you can apply CountVectorizer on the new column as shown below:
CV = CountVectorizer(vocabulary=vocab, , ngram_range=(1,5))
cv_matrix=CV.fit_transform(df.new)
And you can get your desired dataframe using:
pd.DataFrame(cv_matrix.toarray(), columns= CV.get_feature_names())
bread milk fridge rice apple red chillies
0 1 0 1 0 1 0
1 1 1 0 1 0 0
2 1 0 0 1 1 0
3 0 2 0 1 0 0
4 1 1 0 0 0 1

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

pandas pd.read_html heading shifted to the right

I'm trying to convert wiki page table to dataframe. Headings are shifted to the
right, 'Launches' should be there were it is now 'Successes'.
I have used skiprows option, but it did not work.
df = pd.read_html(r'https://en.wikipedia.org/wiki/2018_in_spaceflight',skiprows=[1,2])[7]
df2 = df[df.columns[1:5]]
1 2 3 4
0 Launches Successes Failures Partial failures
1 India 1 1 0
2 Japan 3 3 0
3 New Zealand 1 1 0
4 Russia 3 3 0
5 United States 8 8 0
6 24 23 0 1
The problem is there are merged cells in the first column of the original table. If you want to parse it exactly, you should write a parser. Provisionally, you can try:
df = pd.read_html(r'https://en.wikipedia.org/wiki/2018_in_spaceflight', header=0)[7]
df.columns = [""] + list(df.columns[:-1])
df.iloc[-1] = [""] + list(df.iloc[-1][:-1])

Splitting Column Lists in Pandas DataFrame

I'm looking for an good way to solve the following problem. My current fix is not particularly clean, and I'm hoping to learn from your insight.
Suppose I have a Panda DataFrame, whose entries look like this:
>>> df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
>>> df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
>>> df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
>>> df['IsGlass']=[1,0,1]
>>> df
Color Texture IsGlass
1 NaN ['Rough'] 1
2 ['Red', 'Blue'] NaN 0
3 ['Blue', 'Green', 'Purple'] ['Silky','Shiny','Fuzzy'] 1
So each observation in the index corresponds to something I measured about its color, texture, and whether it's glass or not. What I'd like to do is turn this into a new "indicator" DataFrame, by creating a column for each observed value, and changing the corresponding entry to a one if I observed it, and NaN if I have no information.
>>> df
Red Blue Green Purple Rough Silky Shiny Fuzzy Is Glass
1 Nan Nan Nan Nan 1 NaN Nan Nan 1
2 1 1 Nan Nan Nan Nan Nan Nan 0
3 Nan 1 1 1 Nan 1 1 1 1
I have solution that loops over each column, looks at its values, and through a series of Try/Excepts for non-Nan values splits the lists, creates a new column, etc., and concatenates.
This is my first post to StackOverflow - I hope this post conforms to the posting guidelines. Thanks.
Stacking Hacks!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.stack().unstack(fill_value=[])
def b(c):
d = mlb.fit_transform(c)
return pd.DataFrame(d, c.index, mlb.classes_)
pd.concat([b(df[c]) for c in ['Color', 'Texture']], axis=1).join(df.IsGlass)
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
I am just using pandas, get_dummies
l=[pd.get_dummies(df[x].apply(pd.Series).stack(dropna=False)).sum(level=0) for x in ['Color','Texture']]
pd.concat(l,axis=1).assign(IsGlass=df.IsGlass)
Out[662]:
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1
For each texture/color in each row, I check if the value is null. If not, we add that value as a column = 1 for that row.
import numpy as np
import pandas as pd
df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
df['IsGlass']=[1,0,1]
for row in df.itertuples():
if not np.all(pd.isnull(row.Color)):
for val in row.Color:
df.loc[row.Index,val] = 1
if not np.all(pd.isnull(row.Texture)):
for val in row.Texture:
df.loc[row.Index,val] = 1

Resources