Create new rows out of columns with multiple items in Python - python-3.x

I have these codes and I need to create a data frame similar to the picture attached - Thanks
import pandas as pd
Product = [(100, 'Item1, Item2'),
(101, 'Item1, Item3'),
(102, 'Item4')]
labels = ['product', 'info']
ProductA = pd.DataFrame.from_records(Product, columns=labels)
Cust = [('A', 200),
('A', 202),
('B', 202),
('C', 200),
('C', 204),
('B', 202),
('A', 200),
('C', 204)]
labels = ['customer', 'product']
Cust1 = pd.DataFrame.from_records(Cust, columns=labels)

merge with get_dummies
dfA.merge(dfB).set_index('customer').tags.str.get_dummies(', ').sum(level=0,axis=0)
Out[549]:
chocolate filled glazed sprinkles
customer
A 3 1 0 2
C 1 0 2 1
B 2 2 0 0

IIUC possible with merge, split, melt and concat:
dfB = dfB.merge(dfA, on='product')
dfB = pd.concat([dfB.iloc[:,:-1], dfB.tags.str.split(',', expand=True)], axis=1)
dfB = dfB.melt(id_vars=['customer', 'product']).drop(columns = ['product', 'variable'])
dfB = pd.concat([dfB.customer, pd.get_dummies(dfB['value'])], axis=1)
dfB
Output:
customer filled sprinkles chocolate glazed
0 A 0 0 1 0
1 C 0 0 1 0
2 A 0 0 1 0
3 A 0 0 1 0
4 B 0 0 1 0
5 B 0 0 1 0
6 C 0 0 0 1
7 C 0 0 0 1
8 A 0 1 0 0
9 C 0 1 0 0
10 A 0 1 0 0
11 A 1 0 0 0
12 B 1 0 0 0
13 B 1 0 0 0

Related

pandas groupby; counting overlapping colums

I have a DataFrame that looks like this:
ID A B C D
6234 1 0 1 0
3417 1 0 0 0
9954 0 1 0 0
4369 0 0 0 1
6281 1 0 1 0
And I want to group it so as to make it look like this:
ID
A B C D
1 0 0 0 3
1 0 1 0 2
0 1 0 0 1
0 0 1 0 2
0 0 0 1 1
I have been using the following code, which has not gotten me very far.
import pandas as pd
data = [[6234,1,0,1,0],
[3417,1,0,0,0],
[9954,0,1,0,0],
[4369,0,0,0,1],
[6281,1,0,1,0]]
DF1 = pd.DataFrame(data, columns = ['ID','A','B','C','D'])
DF2 = DF1.groupby(['A','B','C','D']).count()
I would appreciate any insight that anyone might have to offer.

Add extra columns with default values from a list in a dataframe

I have a dataframe like
df = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
Now i want the data frame to have additional columns from a list=['a','b','c'] with default values as 0.
so the output will be
Name Age a b c
Tome 20 0 0 0
nick 21 0 0 0
krish 19 0 0 0
Jack 18 0 0 0
Dont use variable list, because builtin (python code word).
For new columns is possible create dictionary from list and pass to DataFrame.assign:
d = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(d)
L = ['a','b','c']
df1 = df.assign(**dict.fromkeys(L, 0))
Or create new DataFrame and use DataFrame.join:
df1 = df.join(pd.DataFrame(0, columns=L, index=df.index))
print (df1)
Name Age a b c
0 Tom 20 0 0 0
1 nick 21 0 0 0
2 krish 19 0 0 0
3 jack 18 0 0 0
>>> df.join(df.reindex(columns=list('abc'), fill_value=0))
Name Age a b c
0 Tom 20 0 0 0
1 nick 21 0 0 0
2 krish 19 0 0 0
3 jack 18 0 0 0
You can also use reindex to create new df with fill_value zero. and than combine columns by using join.

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

pandas if else only on specific rows

I have a pandas dataframe as below. I want to apply below condition
Only for row where A =2, update the column 'C', 'D' TO -99.
I have a function like below which updates the value of C and D to -99.
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
Now i just want to call that function, if A =2. I tried the below code but it updates all the rows of C and D to -99
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
df
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
if (df['A'] == 2).any():
func(df)
print(df)
My expected output:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0
You can do that by filtering:
df.loc[df['A'] == 2, ['C', 'D']] = -99
Here the first item of the filtering filters the rows, and we filter these such that we only select rows where the value for the column of 'A' is 2. We filter the columns by a list of names (C and D). We then assign -99 to these items.
For the given sample data, we obtain:
>>> df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
>>> df.loc[df['A'] == 2, ['C', 'D']] = -99
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0

Python Pandas Dataframe Melt

I have this as a dataframe:
custid day freq
346782 1 0
346782 0 1
346782 1 2
346783 0 0
346783 0 1
346783 0 2
But for machine learning purposes I want to semi-transpose this into:
346782 1 0 0 1 1 2
346783 0 0 0 1 0 2
You know, so that the custID only comes once with ALL its associated features in one row ahead of it.
I've tried various things such as:
df1 = pd.melt(newdf, id_vars=['0']).drop('variable', axis=1).sort_values(0)
How can I accomplish this transformation?
I am using stack here, you can also try melt
s=df.set_index('custid').stack()
s.index=pd.MultiIndex.from_arrays([s.index.get_level_values(level=0),s.groupby(level=0).cumcount()])
s.unstack()
Out[843]:
0 1 2 3 4 5
custid
346782 1 0 0 1 1 2
346783 0 0 0 1 0 2
Use
In [192]: pd.DataFrame.from_dict(
{k: x[['day', 'freq']].values.flatten() for k, x in df.groupby('custid')},
orient='index')
Out[192]:
0 1 2 3 4 5
346782 1 0 0 1 1 2
346783 0 0 0 1 0 2
You can also try numpy.ravel.
df.groupby("custid").apply(lambda x: x[["day", "freq"]].values.ravel())
custid
346782 [1, 0, 0, 1, 1, 2]
346783 [0, 0, 0, 1, 0, 2]
dtype: object
pd.DataFrame(
df.groupby("custid").apply(lambda x: x[["day", "freq"]].values.ravel()).to_dict()
).T
0 1 2 3 4 5
346782 1 0 0 1 1 2
346783 0 0 0 1 0 2

Resources