I have DataFrame's columns and data in list i want to put the relevant data to relevant column - python-3.x

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0

If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0

This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0

You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

Related

pandas groupby; counting overlapping colums

I have a DataFrame that looks like this:
ID A B C D
6234 1 0 1 0
3417 1 0 0 0
9954 0 1 0 0
4369 0 0 0 1
6281 1 0 1 0
And I want to group it so as to make it look like this:
ID
A B C D
1 0 0 0 3
1 0 1 0 2
0 1 0 0 1
0 0 1 0 2
0 0 0 1 1
I have been using the following code, which has not gotten me very far.
import pandas as pd
data = [[6234,1,0,1,0],
[3417,1,0,0,0],
[9954,0,1,0,0],
[4369,0,0,0,1],
[6281,1,0,1,0]]
DF1 = pd.DataFrame(data, columns = ['ID','A','B','C','D'])
DF2 = DF1.groupby(['A','B','C','D']).count()
I would appreciate any insight that anyone might have to offer.

How to identify a sequence and index number before a particular sequence occurs for the first time

I have a dataframe in pandas, an example of which is provided below:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
A 1 0 0 1 0 1
B 1 1 0 0 1 0
C 1 0 1 1 0 0
D 1 1 0 1 0 0
As you can see 1 and 0 occurs randomly in different columns. It would be helpful, if anyone can suggest me a code in python such that I am able to count the number of times '1' occurs before the first occurrence of a 1, 0 and 0 in order. For example, for member A, the first double zero event occurs at appear_2 and appear_3, so the duration will be 1. Similarly for the member B, the first double zero event occurs at appear_3 and appear_4 so there are a total of two 1s that occur before this. So, the 1 included in 1,0,0 sequence is also considered during the count of total number of 1. it is because the 1 indicates that a person started the process, and 0,0 indicates his/her absence for two consecutive appearances after initiating the process. The resulting table should have a new column 'duration' something like this:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
A 1 0 0 1 0 1 1
B 1 1 0 0 1 0 2
C 1 0 1 1 0 0 3
D 1 1 1 1 0 0 4
Thank you in advance.
A little logic here , first we use rolling sum find the value equal to 0 , then we just need to do cumprod, once it hit the 0, the prod will return 0, then we just need to sum all value not 0 for each row get the result
s=df.iloc[:,1:]
s1=s.rolling(2,axis=1,min_periods=1).sum().cumprod(axis=1)
s.mask(s1==0).sum(1)
Out[37]:
0 1.0
1 2.0
2 3.0
3 4.0
dtype: float64
My logic is checking the current position to next position. If they are both 0, the mask turns to True at that location. After that doing cumsum on axis=1. Locations are in front the first True will turn to 0 by cumsum. Finally, comparing mask to 0 to keep only positions appear before the double 0 and sum. To use this logic, I need to handle the case where double 0 are the first elements in row as in 'D', 0, 0, 1, 1, 0, 0. Your sample doesn't have this case. However, I expect the real data would have it.
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[100]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 1 1 1 1 0 0 4.0
Change your sample to have the special case where the first elements are 0
Update: add case E where all appear_x are 1.
Sample (df_n):
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6
0 A 1 0 0 1 0 1
1 B 1 1 0 0 1 0
2 C 1 0 1 1 0 0
3 D 0 0 1 1 0 0
4 E 1 1 1 1 1 1
cols = ['appear_1', 'appear_2', 'appear_3', 'appear_4', 'appear_5', 'appear_6']
df1 = df_n[cols]
m = df1[df1.eq(1)].ffill(1).notna()
df2 = df1[m].bfill(1).eq(0)
m2 = df2 & df2.shift(-1, axis=1, fill_value=True)
df_n['duration'] = df1[m2.cumsum(1) == 0].sum(1)
Out[503]:
Person appear_1 appear_2 appear_3 appear_4 appear_5 appear_6 duration
0 A 1 0 0 1 0 1 1.0
1 B 1 1 0 0 1 0 2.0
2 C 1 0 1 1 0 0 3.0
3 D 0 0 1 1 0 0 2.0
4 E 1 1 1 1 1 1 6.0

Pandas DataFrame: create a matrix-like with 0 and 1

i have to create a matrix-like with 0 and 1. How can i create something like that?
This is my DataFrame:
I want to check the intersection where df['luogo'] is 'sala' and df['sala'] and replace it with 1.
This is my try:
for head in dataframe.columns:
for i in dataframe['luogo']:
if i == head:
dataframe[head] = 1
else:
dataframe[head] = 0
Sorry for the italian dataframe.
You are probably looking for pandas.get_dummies(..) [pandas-doc]. For a given dataframe df:
>>> df
luogo
0 sala
1 scuola
2 teatro
3 sala
We get:
>>> pd.get_dummies(df['luogo'])
sala scuola teatro
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
You thus can join this with your original dataframe with:
>>> df.join(pd.get_dummies(df['luogo']))
luogo sala scuola teatro
0 sala 1 0 0
1 scuola 0 1 0
2 teatro 0 0 1
3 sala 1 0 0
This thus constructs a "one hot encoding" [wiki] of the values in your original dataframe.

pandas assign value in multiple columns based on value in one

I have a dataset like this,
sample = {'Theme': ['never give a ten','interaction speed','no feedback,premium'],
'cat1': [0,0,0],
'cat2': [0,0,0],
'cat3': [0,0,0],
'cat4': [0,0,0]
}
pd.DataFrame(sample,columns = ['Theme','cat1','cat2','cat3','cat4'])
Theme cat1 cat2 cat3 cat4
0 never give a ten 0 0 0 0
1 interaction speed 0 0 0 0
2 no feedback,premium 0 0 0 0
Now, I need to replace the values in cat columns based on value in Theme. If the Theme column has 'never give a ten', then change cat1 as 1, similarly if the theme column has 'interaction speed', then change cat2 as 1, if the theme column has 'no feedback' in it, change 'cat3' as 1 and for 'premium' change cat4 as 1.
In this sample I have provided 4 categories, I have in total 21 categories. I can do if word in string 21 times for 21 categories, but I am looking for an efficient way to write this in a function, loop every row and go through the logic and update the corresponding columns, can anyone help please?
Thanks in advance.
Here is possible set columns names by categories with Series.str.get_dummies - columns names are sorted:
df1 = df['Theme'].str.get_dummies(',')
print (df1)
interaction speed never give a ten no feedback premium
0 0 1 0 0
1 1 0 0 0
2 0 0 1 1
If need first column in output add DataFrame.join:
df11 = df[['Theme']].join(df['Theme'].str.get_dummies(','))
print (df11)
Theme interaction speed never give a ten no feedback \
0 never give a ten 0 1 0
1 interaction speed 1 0 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1
If order of columns is important add DataFrame.reindex:
#removed posible duplicates with remain ordering
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df['Theme'].str.get_dummies(',').reindex(cols, axis=1)
print (df2)
never give a ten interaction speed no feedback premium
0 1 0 0 0
1 0 1 0 0
2 0 0 1 1
cols = dict.fromkeys([y for x in df['Theme'] for y in x.split(',')]).keys()
df2 = df[['Theme']].join(df['Theme'].str.get_dummies(',').reindex(cols, axis=1))
print (df2)
Theme never give a ten interaction speed no feedback \
0 never give a ten 1 0 0
1 interaction speed 0 1 0
2 no feedback,premium 0 0 1
premium
0 0
1 0
2 1

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy
You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

Resources