Add extra columns with default values from a list in a dataframe - python-3.x

I have a dataframe like
df = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
Now i want the data frame to have additional columns from a list=['a','b','c'] with default values as 0.
so the output will be
Name Age a b c
Tome 20 0 0 0
nick 21 0 0 0
krish 19 0 0 0
Jack 18 0 0 0

Dont use variable list, because builtin (python code word).
For new columns is possible create dictionary from list and pass to DataFrame.assign:
d = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(d)
L = ['a','b','c']
df1 = df.assign(**dict.fromkeys(L, 0))
Or create new DataFrame and use DataFrame.join:
df1 = df.join(pd.DataFrame(0, columns=L, index=df.index))
print (df1)
Name Age a b c
0 Tom 20 0 0 0
1 nick 21 0 0 0
2 krish 19 0 0 0
3 jack 18 0 0 0

>>> df.join(df.reindex(columns=list('abc'), fill_value=0))
Name Age a b c
0 Tom 20 0 0 0
1 nick 21 0 0 0
2 krish 19 0 0 0
3 jack 18 0 0 0
You can also use reindex to create new df with fill_value zero. and than combine columns by using join.

Related

Using Pandas to assign specific values

I have the following dataframe:
data = {'id': [1, 2, 3, 4, 5, 6, 7, 8],
'stat': ['ordered', 'unconfirmed', 'ordered', 'unknwon', 'ordered', 'unconfirmed', 'ordered', 'back'],
'date': ['2021', '2022', '2023', '2024', '2025','2026','2027', '1990']
}
df = pd.DataFrame(data)
df
I am trying to get the following data frame:
Unfortunate I am not successful so far and I used the following commands (for loops) for only stat==ordered:
y0 = np.zeros((len(df), 8), dtype=int)
y1 = [1990]
if stat=='ordered':
for i in df['id']:
for j in y1:
if df.loc[i].at['date'] in y1:
y0[i][y1.index(j)] = 1
else:
y0[i][y1.index(j)] = 0
But unfortunately it did not returned the expected solution and beside that it takes a very long time to do the calculation. I tried to use gruopby, but it could not fgure out either how to use it perporly since it is faster than using for loops. Any idea would be very appreiciated.
IIUC:
df.join(
pd.get_dummies(df.date).cumsum(axis=1).mul(
[1, 2, 1, 3, 1, 2, 1, 0], axis=0
).astype(int)
)
id stat date 1990 2021 2022 2023 2024 2025 2026 2027
0 1 ordered 2021 0 1 1 1 1 1 1 1
1 2 unconfirmed 2022 0 0 2 2 2 2 2 2
2 3 ordered 2023 0 0 0 1 1 1 1 1
3 4 unknwon 2024 0 0 0 0 3 3 3 3
4 5 ordered 2025 0 0 0 0 0 1 1 1
5 6 unconfirmed 2026 0 0 0 0 0 0 2 2
6 7 ordered 2027 0 0 0 0 0 0 0 1
7 8 back 1990 0 0 0 0 0 0 0 0

How to split string with the values in their specific columns indexed on their label?

I have the following data
Index Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
. .
. .
. .
I have to create split the data format into different columns each with their own header and their values, the result should be as below:
Index Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
. .
. .
. .
The data is not limited to CO/PET/CV/EL, will need as many columns needed each displaying its corresponding value.
The .str.split('-', expand=True) function will only delimit the data and keep all first values in same column and does not rename each column.
Is there a way to implement this in python?
You could do:
df.Data.str.split('-').explode().str.split(r'(?<=\d)(?=\D)',expand = True). \
reset_index().pivot('index',1,0).fillna(0).reset_index()
1 Index CO CV EL PET
0 0 100 0 0 0
1 1 50 0 0 50
2 2 0 98 2 0
3 3 50 50 0 0
Idea is first split values by -, then extract numbers and no numbers values to tuples, append to list and convert to dictionaries. It is passed in list comprehension to DataFrame cosntructor, replaced misisng values and converted to numeric:
import re
def f(x):
L = []
for val in x.split('-'):
k, v = re.findall('(\d+)(\D+)', val)[0]
L.append((v, k))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
If in data exist some values without number or number only solution should be changed for more general like:
print (df)
Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
4 AAA
5 20
def f(x):
L = []
for val in x.split('-'):
extracted = re.findall('(\d+)(\D+)', val)
if len(extracted) > 0:
k, v = extracted[0]
L.append((v, k))
else:
if val.isdigit():
L.append(('No match digit', val))
else:
L.append((val, 0))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL AAA No match digit
0 100CO 100 0 0 0 0 0
1 50CO-50PET 50 50 0 0 0 0
2 98CV-2EL 0 0 98 2 0 0
3 50CV-50CO 50 0 50 0 0 0
4 AAA 0 0 0 0 0 0
5 20 0 0 0 0 0 20
Try this:
import pandas as pd
import re
df = pd.DataFrame({'Data':['100CO', '50CO-50PET', '98CV-2EL', '50CV-50CO']})
split_df = pd.DataFrame(df.Data.apply(lambda x: {re.findall('[A-Z]+', el)[0] : re.findall('[0-9]+', el)[0] \
for el in x.split('-')}).tolist())
split_df = split_df.fillna(0)
df = pd.concat([df, split_df], axis = 1)

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

Pandas DataFrame: create a matrix-like with 0 and 1

i have to create a matrix-like with 0 and 1. How can i create something like that?
This is my DataFrame:
I want to check the intersection where df['luogo'] is 'sala' and df['sala'] and replace it with 1.
This is my try:
for head in dataframe.columns:
for i in dataframe['luogo']:
if i == head:
dataframe[head] = 1
else:
dataframe[head] = 0
Sorry for the italian dataframe.
You are probably looking for pandas.get_dummies(..) [pandas-doc]. For a given dataframe df:
>>> df
luogo
0 sala
1 scuola
2 teatro
3 sala
We get:
>>> pd.get_dummies(df['luogo'])
sala scuola teatro
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
You thus can join this with your original dataframe with:
>>> df.join(pd.get_dummies(df['luogo']))
luogo sala scuola teatro
0 sala 1 0 0
1 scuola 0 1 0
2 teatro 0 0 1
3 sala 1 0 0
This thus constructs a "one hot encoding" [wiki] of the values in your original dataframe.

Create new rows out of columns with multiple items in Python

I have these codes and I need to create a data frame similar to the picture attached - Thanks
import pandas as pd
Product = [(100, 'Item1, Item2'),
(101, 'Item1, Item3'),
(102, 'Item4')]
labels = ['product', 'info']
ProductA = pd.DataFrame.from_records(Product, columns=labels)
Cust = [('A', 200),
('A', 202),
('B', 202),
('C', 200),
('C', 204),
('B', 202),
('A', 200),
('C', 204)]
labels = ['customer', 'product']
Cust1 = pd.DataFrame.from_records(Cust, columns=labels)
merge with get_dummies
dfA.merge(dfB).set_index('customer').tags.str.get_dummies(', ').sum(level=0,axis=0)
Out[549]:
chocolate filled glazed sprinkles
customer
A 3 1 0 2
C 1 0 2 1
B 2 2 0 0
IIUC possible with merge, split, melt and concat:
dfB = dfB.merge(dfA, on='product')
dfB = pd.concat([dfB.iloc[:,:-1], dfB.tags.str.split(',', expand=True)], axis=1)
dfB = dfB.melt(id_vars=['customer', 'product']).drop(columns = ['product', 'variable'])
dfB = pd.concat([dfB.customer, pd.get_dummies(dfB['value'])], axis=1)
dfB
Output:
customer filled sprinkles chocolate glazed
0 A 0 0 1 0
1 C 0 0 1 0
2 A 0 0 1 0
3 A 0 0 1 0
4 B 0 0 1 0
5 B 0 0 1 0
6 C 0 0 0 1
7 C 0 0 0 1
8 A 0 1 0 0
9 C 0 1 0 0
10 A 0 1 0 0
11 A 1 0 0 0
12 B 1 0 0 0
13 B 1 0 0 0

Resources