Pandas DataFrame: create a matrix-like with 0 and 1 - python-3.x

i have to create a matrix-like with 0 and 1. How can i create something like that?
This is my DataFrame:
I want to check the intersection where df['luogo'] is 'sala' and df['sala'] and replace it with 1.
This is my try:
for head in dataframe.columns:
for i in dataframe['luogo']:
if i == head:
dataframe[head] = 1
else:
dataframe[head] = 0
Sorry for the italian dataframe.

You are probably looking for pandas.get_dummies(..) [pandas-doc]. For a given dataframe df:
>>> df
luogo
0 sala
1 scuola
2 teatro
3 sala
We get:
>>> pd.get_dummies(df['luogo'])
sala scuola teatro
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
You thus can join this with your original dataframe with:
>>> df.join(pd.get_dummies(df['luogo']))
luogo sala scuola teatro
0 sala 1 0 0
1 scuola 0 1 0
2 teatro 0 0 1
3 sala 1 0 0
This thus constructs a "one hot encoding" [wiki] of the values in your original dataframe.

Related

How to split string with the values in their specific columns indexed on their label?

I have the following data
Index Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
. .
. .
. .
I have to create split the data format into different columns each with their own header and their values, the result should be as below:
Index Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
. .
. .
. .
The data is not limited to CO/PET/CV/EL, will need as many columns needed each displaying its corresponding value.
The .str.split('-', expand=True) function will only delimit the data and keep all first values in same column and does not rename each column.
Is there a way to implement this in python?
You could do:
df.Data.str.split('-').explode().str.split(r'(?<=\d)(?=\D)',expand = True). \
reset_index().pivot('index',1,0).fillna(0).reset_index()
1 Index CO CV EL PET
0 0 100 0 0 0
1 1 50 0 0 50
2 2 0 98 2 0
3 3 50 50 0 0
Idea is first split values by -, then extract numbers and no numbers values to tuples, append to list and convert to dictionaries. It is passed in list comprehension to DataFrame cosntructor, replaced misisng values and converted to numeric:
import re
def f(x):
L = []
for val in x.split('-'):
k, v = re.findall('(\d+)(\D+)', val)[0]
L.append((v, k))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
If in data exist some values without number or number only solution should be changed for more general like:
print (df)
Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
4 AAA
5 20
def f(x):
L = []
for val in x.split('-'):
extracted = re.findall('(\d+)(\D+)', val)
if len(extracted) > 0:
k, v = extracted[0]
L.append((v, k))
else:
if val.isdigit():
L.append(('No match digit', val))
else:
L.append((val, 0))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL AAA No match digit
0 100CO 100 0 0 0 0 0
1 50CO-50PET 50 50 0 0 0 0
2 98CV-2EL 0 0 98 2 0 0
3 50CV-50CO 50 0 50 0 0 0
4 AAA 0 0 0 0 0 0
5 20 0 0 0 0 0 20
Try this:
import pandas as pd
import re
df = pd.DataFrame({'Data':['100CO', '50CO-50PET', '98CV-2EL', '50CV-50CO']})
split_df = pd.DataFrame(df.Data.apply(lambda x: {re.findall('[A-Z]+', el)[0] : re.findall('[0-9]+', el)[0] \
for el in x.split('-')}).tolist())
split_df = split_df.fillna(0)
df = pd.concat([df, split_df], axis = 1)

pandas groupby; counting overlapping colums

I have a DataFrame that looks like this:
ID A B C D
6234 1 0 1 0
3417 1 0 0 0
9954 0 1 0 0
4369 0 0 0 1
6281 1 0 1 0
And I want to group it so as to make it look like this:
ID
A B C D
1 0 0 0 3
1 0 1 0 2
0 1 0 0 1
0 0 1 0 2
0 0 0 1 1
I have been using the following code, which has not gotten me very far.
import pandas as pd
data = [[6234,1,0,1,0],
[3417,1,0,0,0],
[9954,0,1,0,0],
[4369,0,0,0,1],
[6281,1,0,1,0]]
DF1 = pd.DataFrame(data, columns = ['ID','A','B','C','D'])
DF2 = DF1.groupby(['A','B','C','D']).count()
I would appreciate any insight that anyone might have to offer.

I have DataFrame's columns and data in list i want to put the relevant data to relevant column

suppose you have given list of all item you can have and separately you have list of data and whose shape of list is not fixed it may contain any number of item you wished to create a dataframe from it and you have to put it on write column
for example
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
# and from this I wants to create a dummy variable like this
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
If want indicator columns filled by 0 and 1 only use MultiLabelBinarizer with DataFrame.reindex if want change ordering of columns by list and if possible some value not exist add only 0 column:
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = (pd.DataFrame(mlb.fit_transform(data),columns=mlb.classes_)
.reindex(columns, axis=1, fill_value=0))
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
Or Series.str.get_dummies:
df = pd.Series(data).str.join('|').str.get_dummies().reindex(columns, axis=1, fill_value=0)
print (df)
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
This is one approach using collections.Counter.
Ex:
from collections import Counter
columns = ['shirt','shoe','tie','hat']
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt']]
data = map(Counter, data)
#df = pd.DataFrame(data, columns=columns)
df = pd.DataFrame(data, columns=columns).fillna(0).astype(int)
print(df)
Output:
shirt shoe tie hat
0 0 0 1 1
1 1 1 1 0
2 1 0 1 0
You can try converting data to a dataframe:
data = [['hat','tie'],
['shoe', 'tie', 'shirt'],
['tie', 'shirt',]]
df = pd.DataFrame(data)
df
0 1 2
0 hat tie None
1 shoe tie shirt
2 tie shirt None
Them use:
pd.get_dummies(df.stack()).groupby(level=0).agg('sum')
hat shirt shoe tie
0 1 0 0 1
1 0 1 1 1
2 0 1 0 1
Explanation:
df.stack() returns a MultiIndex Series:
0 0 hat
1 tie
1 0 shoe
1 tie
2 shirt
2 0 tie
1 shirt
dtype: object
If we get the dummy values of this series we get:
hat shirt shoe tie
0 0 1 0 0 0
1 0 0 0 1
1 0 0 0 1 0
1 0 0 0 1
2 0 1 0 0
2 0 0 0 0 1
1 0 1 0 0
Then you just have to groupby the index and merge them using sum (because we know that there will only be one or zero after get_dummies):
df = pd.get_dummies(df.stack()).groupby(level=0).agg('sum')

Writing Function on Data Frame in Pandas

I have data in excel which have two columns 'Peak Value' & 'Label'. I want to add value in 'Label' column based on 'Peak Value' column.
So, Input looks like below
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 0 0 0 0 0 0 0 0
Input
Whenever the value in 'Peak Value' is greater than zero then it add 1 in 'Label' and replace all the zeros below it. For the next value greater than zero it should get incremented to 2 and replace all the zeros by 2.
So, the output will look like this:
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 1 1 1 2 2 2 3 3
Output
and so on....
I tried writing function but I am only able to add 1 when the value is greater than 0 in 'Peak Value'.
def funct(row):
if row['Peak Value']>0:
val = 1
else:
val = 0
return val
df['Label']= df.apply(funct, axis=1)
May be you could try using cumsum and ffill:
import numpy as np
df['Labels'] = (df['Peak Value'] > 0).groupby(df['Peak Value']).cumsum()
df['Labels'] = df['Labels'].replace(0, np.nan).ffill().replace(np.nan, 0).astype(int)
Output:
Peak Value Labels
0 0 0
1 0 0
2 0 0
3 88 1
4 0 1
5 0 1
6 88 2
7 0 2
8 0 2
9 88 3
10 0 3

Comparing two different sized pandas Dataframes and to find the row index with equal values

I need some help with comparing two pandas dataframe
I have two dataframes
The first dataframe is
df1 =
a b c d
0 1 1 1 1
1 0 1 0 1
2 0 0 0 1
3 1 1 1 1
4 1 0 1 0
5 1 1 1 0
6 0 0 1 0
7 0 1 0 1
and the second dataframe is
df2 =
a b c d
0 1 1 1 1
1 1 0 1 0
2 0 0 1 0
I want to find the row index of dataframe 1 (df1) which the entire row is the same as the rows in dataframe 2 (df2). My expect result would be
0
3
4
6
The order of the above index does not need to be in order, all I want is the index of dataframe 1 (df1)
Is there a way without using for loop?
Thanks
Tommy
You can using merge
df1.merge(df2,indicator=True,how='left').loc[lambda x : x['_merge']=='both'].index
Out[459]: Int64Index([0, 3, 4, 6], dtype='int64')

Resources