Splitting Column Lists in Pandas DataFrame - python-3.x

I'm looking for an good way to solve the following problem. My current fix is not particularly clean, and I'm hoping to learn from your insight.
Suppose I have a Panda DataFrame, whose entries look like this:
>>> df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
>>> df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
>>> df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
>>> df['IsGlass']=[1,0,1]
>>> df
Color Texture IsGlass
1 NaN ['Rough'] 1
2 ['Red', 'Blue'] NaN 0
3 ['Blue', 'Green', 'Purple'] ['Silky','Shiny','Fuzzy'] 1
So each observation in the index corresponds to something I measured about its color, texture, and whether it's glass or not. What I'd like to do is turn this into a new "indicator" DataFrame, by creating a column for each observed value, and changing the corresponding entry to a one if I observed it, and NaN if I have no information.
>>> df
Red Blue Green Purple Rough Silky Shiny Fuzzy Is Glass
1 Nan Nan Nan Nan 1 NaN Nan Nan 1
2 1 1 Nan Nan Nan Nan Nan Nan 0
3 Nan 1 1 1 Nan 1 1 1 1
I have solution that loops over each column, looks at its values, and through a series of Try/Excepts for non-Nan values splits the lists, creates a new column, etc., and concatenates.
This is my first post to StackOverflow - I hope this post conforms to the posting guidelines. Thanks.

Stacking Hacks!
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.stack().unstack(fill_value=[])
def b(c):
d = mlb.fit_transform(c)
return pd.DataFrame(d, c.index, mlb.classes_)
pd.concat([b(df[c]) for c in ['Color', 'Texture']], axis=1).join(df.IsGlass)
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1

I am just using pandas, get_dummies
l=[pd.get_dummies(df[x].apply(pd.Series).stack(dropna=False)).sum(level=0) for x in ['Color','Texture']]
pd.concat(l,axis=1).assign(IsGlass=df.IsGlass)
Out[662]:
Blue Green Purple Red Fuzzy Rough Shiny Silky IsGlass
1 0 0 0 0 0 1 0 0 1
2 1 0 0 1 0 0 0 0 0
3 1 1 1 0 1 0 1 1 1

For each texture/color in each row, I check if the value is null. If not, we add that value as a column = 1 for that row.
import numpy as np
import pandas as pd
df=pd.DataFrame(index=[1,2,3],columns=['Color','Texture','IsGlass'])
df['Color']=[np.nan,['Red','Blue'],['Blue', 'Green', 'Purple']]
df['Texture']=[['Rough'],np.nan,['Silky', 'Shiny', 'Fuzzy']]
df['IsGlass']=[1,0,1]
for row in df.itertuples():
if not np.all(pd.isnull(row.Color)):
for val in row.Color:
df.loc[row.Index,val] = 1
if not np.all(pd.isnull(row.Texture)):
for val in row.Texture:
df.loc[row.Index,val] = 1

Related

How to add value to specific index that is out of bounds

I have a list array
list = [[0, 1, 2, 3, 4, 5],[0],[1],[2],[3],[4],[5]]
Say I add [6, 7, 8] to the first row as the header for my three new columns, what's the best way to add values in these new columns, without getting index out of bounds? I've tried first filling all three columns with "" but when I add a value, it then pushes the "" out to the right and increases my list size.
Would it be any easier to use a Pandas dataframe? Are you allowed "gaps" in a Pandas dataframe?
according to ops comment i think a pandas df is the more appropriate solution. you can not have 'gaps', but nan values like this
import pandas as pd
# create sample data
a = np.arange(1, 6)
df = pd.DataFrame(zip(*[a]*5))
print(df)
output:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
3 4 4 4 4 4
4 5 5 5 5 5
for adding empty columns:
# add new columns, not empty but filled w/ nan
df[5] = df[6] = df[7] = float('nan')
# fill single value in column 7, index 3
df[7].iloc[4] = 123
print(df)
output:
0 1 2 3 4 5 6 7
0 1 1 1 1 1 NaN NaN NaN
1 2 2 2 2 2 NaN NaN NaN
2 3 3 3 3 3 NaN NaN NaN
3 4 4 4 4 4 NaN NaN NaN
4 5 5 5 5 5 NaN NaN 123.0

Multiplying 2 pandas dataframes generates nan

I have 2 dataframes as below
import pandas as pd
dat = pd.DataFrame({'val1' : [1,2,1,2,4], 'val2' : [1,2,1,2,4]})
dat1 = pd.DataFrame({'val3' : [1,2,1,2,4]})
Now with each column of dat and want to multiply dat1. So I did below
dat * dat1
However this generates nan value for all elements.
Could you please help on what is the correct approach? I could run a for loop with each column of dat, but I wonder if there are any better method available to perform the same.
Thanks for your pointer.
When doing multiplication (or any arithmetic operation), pandas does index alignment. This goes for both the index and columns in case of dataframes. If matches, it multiplies; otherwise puts NaN and the result has the union of the indices and columns of the operands.
So, to "avoid" this alignment, make dat1 a label-unaware data structure, e.g., a NumPy array:
In [116]: dat * dat1.to_numpy()
Out[116]:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16
To see what's "really" being multiplied, you can align yourself:
In [117]: dat.align(dat1)
Out[117]:
( val1 val2 val3
0 1 1 NaN
1 2 2 NaN
2 1 1 NaN
3 2 2 NaN
4 4 4 NaN,
val1 val2 val3
0 NaN NaN 1
1 NaN NaN 2
2 NaN NaN 1
3 NaN NaN 2
4 NaN NaN 4)
(extra: you have the indices same for dat & dat1; please change one of them's index, and then align again to see the union-behaviour.)
You need to change two things:
use mul with axis=0
use a Series instead of dat1 (else multiplication will try to align the indices, there is no common ones between your two dataframes
out = dat.mul(dat1['val3'], axis=0)
output:
val1 val2
0 1 1
1 4 4
2 1 1
3 4 4
4 16 16

Adding NaN changes dtype of column in Pandas dataframe

I have an int dataframe:
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
3 9 10 11
But if I set a value to NaN, the whole column is cast to floats! Apparently int columns can't have NaN values. But why is that?
>>> df.iloc[2,1] = np.nan
>>> df
0 1 2
0 0 1.0 2
1 3 4.0 5
2 6 NaN 8
3 9 10.0 11
For performance reasons (which make a big impact in this case), Pandas wants your columns to be from the same type, and thus will do its best to keep it that way. NaN is a float value, and all your integers can be harmlessly converted to floats, so that's what happens.
If it can't, you get what needs to happen to make this work:
>>> x = pd.DataFrame(np.arange(4).reshape(2,2))
>>> x
0 1
0 0 1
1 2 3
>>> x[1].dtype
dtype('int64')
>>> x.iloc[1, 1] = 'string'
>>> x
0 1
0 0 1
1 2 string
>>> x[1].dtype
dtype('O')
since 1 can't be converted to a string in a reasonable manner (without guessing what the user wants), the type is converted to object which is general and doesn't allow for any optimizations. This gives you what is needed to make what you want work though (a multi-type column):
>>> x[1] = x[1].astype('O') # Alternatively use a non-float NaN object
>>> x.iloc[1, 1] = np.nan # or float('nan')
>>> x
0 1
0 0 1
1 2 NaN
This is usually not recommended at all though if you don't have to.
Not best but visually better is to use pd.NA rather than np.NaN:
>>> df.iloc[2,1] = pd.NA
>>> df
0 1 2
0 0 1 2
1 3 4 5
2 6 <NA> 8
3 9 10 11
Seems to be good but:
>>> df.dtypes
0 int64
1 object # <- not float, but object
2 int64
dtype: object
You can read this page from the documentation.

Reverse the Multi label binarizer in pandas

I have pandas dataframe as
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
# load sample data
df = pd.DataFrame( {'user_id':['1','1','2','2','2','3'], 'fruits':['banana','orange','orange','apple','banana','mango']})
I collect all the fruits for each user using below code -
# collect fruits for each user
transformed_df= df.groupby('user_id').agg({'fruits':lambda x: list(x)}).reset_index()
print(transformed_df)
user_id fruits
0 1 [banana, orange]
1 2 [orange, apple, banana]
2 3 [mango]
Once I get this list, I do multilabel-binarizer operation to convert this list into ones or zeroes
# perform MultiLabelBinarizer
final_df = transformed_df.join(pd.DataFrame(mlb.fit_transform(transformed_df.pop('fruits')),columns=mlb.classes_,index=transformed_df.index))
print(final_df)
user_id apple banana mango orange
0 1 0 1 0 1
1 2 1 1 0 1
2 3 0 0 1 0
Now, I have a requirement wherein, the input dataframe given to me is final_df and I need to get back the transformed_df which contains the list of fruits for each user.
How can I get this transformed_df back , given that I have final_df as input dataframe?
I am trying to get this working
# Trying to get this working
inverse_df = final_df.join(pd.DataFrame(mlb.inverse_transform(final_df.loc[:, final_df.columns != 'user_id'].as_matrix())))
inverse_df
user_id apple banana mango orange 0 1 2
0 1 0 1 0 1 banana orange None
1 2 1 1 0 1 apple banana orange
2 3 0 0 1 0 mango None None
But it doesnt give me the list back.
inverse_transform() method should help. Here's the documentation - https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MultiLabelBinarizer.html#sklearn.preprocessing.MultiLabelBinarizer.inverse_transform.

Append Two Dataframes Together (Pandas, Python3)

I am trying to append/join(?) two different dataframes together that don't share any overlapping data.
DF1 looks like
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
....
Brown 6
and DF2 looks like
Area Miles
2 3
1 2
....
7 12
I am trying to append these together using
bigdata = df1.append(df2,ignore_index = True).reset_index()
but I get this
Teams Points
Red 2
Green 1
Orange 3
Yellow 4
Area Miles
2 3
1 2
How do I get something like this?
Teams Points Area Miles
Red 2 2 3
Green 1 1 2
Orange 3
Yellow 4
EDIT: in regards to Edchum's answers, I have tried merge and join but each create somewhat strange tables. Instead of what I am looking for (as listed above) it will return something like this:
Teams Points Area Miles
Red 2 2 3
Green 1
Orange 3 1 2
Yellow 4
Use concat and pass param axis=1:
In [4]:
pd.concat([df1,df2], axis=1)
Out[4]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
join also works:
In [8]:
df1.join(df2)
Out[8]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
As does merge:
In [11]:
df1.merge(df2,left_index=True, right_index=True, how='left')
Out[11]:
Teams Points Area Miles
0 Red 2 2 3
1 Green 1 1 2
2 Orange 3 NaN NaN
3 Yellow 4 NaN NaN
EDIT
In the case where the indices do not align where for example your first df has index [0,1,2,3] and your second df has index [0,2] this will mean that the above operations will naturally align against the first df's index resulting in a NaN row for index row 1. To fix this you can reindex the second df either by calling reset_index() or assign directly like so: df2.index =[0,1].

Resources