find out percentage of duplicates - python-3.x

I have the following data:
id date A Area Price Hol
0 1 2019-01-01 No 80 200 No
1 2 2019-01-02 Yes 100 300 Yes
2 3 2019-01-03 Yes 100 300 Yes
3 4 2019-01-04 No 50 100 No
4 5 2019-01-05 No 20 50 No
5 1 2019-01-01 No 80 200 No
I want to find out duplicates (for the same id).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 1], 'date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-01'],
'A': ['No', 'Yes', 'Yes', 'No', 'No', 'No'],
'Area': [80, 100, 100, 50, 20, 80], 'Price': [200, 300, 300, 100, 50, 200],
'Hol': ['No', 'Yes', 'Yes', 'No', 'No', 'No']})
df['date'] = pd.to_datetime(df['date'])
fig, ax = plt.subplots(figsize=(15, 7))
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().plot(ax=ax)
I can see that I have one duplicate (for id 1 , all the entries are the same)
Now, I want to find out what percentage those duplicates represent in the whole dataset.
I can't find a way to express this, since I am already using value_counts() in order to find the duplicates and I can't do something like:
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().size()
percentage = (test / test.groupby(level=0).sum()) * 100

I believe you need DataFrame.duplicated with Series.value_counts:
percentage = df.duplicated(keep=False).value_counts(normalize=True) * 100
print (percentage)
False 66.666667
True 33.333333
dtype: float64

Is duplicated what you need ?
df.duplicated(keep=False).mean()
Out[107]: 0.3333333333333333

Related

pandas move to correspondent column based on value of other column

Im trying to move the f1_am, f2_am, f3_am to the correspondent column based on the values of f1_ty, f2_ty, f3_ty
I started adding new columns to the dataframe based on unique values from the _ty using sets, but I'm trying to figure it out how to move the _am values to were it belongs
Looked for the option of group by and pivot but the result exploded my mind....
I would appreciate some guidance.
Below the code.
import pandas as pd
import numpy as np
data = {
'mem_id': ['A', 'B', 'C', 'A', 'B', 'C']
, 'date_inf': ['01/01/2019', '01/01/2019', '01/01/2019', '02/01/2019', '02/01/2019', '02/01/2019']
, 'f1_ty': ['ABC', 'ABC', 'ABC', 'ABC', 'GHI', 'GHI']
, 'f1_am': [100, 20, 57, 44, 15, 10]
, 'f2_ty': ['DEF', 'DEF', 'DEF', 'GHI', 'ABC', 'XYZ']
, 'f2_am':[20, 30, 45, 66, 14, 21]
, 'f3_ty': ['XYZ', 'GHI', 'OPQ', 'OPQ', 'XYZ', 'DEF']
, 'f3_am':[20, 30, 45, 66, 14, 21]
}
df = pd.DataFrame (data)
#distinct values in columns using sets
distinct_values = sorted(list(set(df['f1_ty'])|set(df['f2_ty'])|set(df['f3_ty'])))
# add distinct values as new columns in the DataFrame
new_df = df.reindex(columns = np.append( df.columns.values, distinct_values))
So this would be my starting point and my wanted result.
Here is a try, thanks for the interesting problem (rename colujmns to make compatible to wide_to_long() followed by unstack() while dropping extra levels:
m=df.set_index(['mem_id','date_inf']).rename(columns=lambda x: ''.join(x.split('_')[::-1]))
n=(pd.wide_to_long(m.reset_index(),['tyf','amf'],['mem_id','date_inf'],'v')
.droplevel(-1).set_index('tyf',append=True).unstack(fill_value=0).reindex(m.index))
final=n.droplevel(0,axis=1).rename_axis(None,axis=1).reset_index()
print(final)
mem_id date_inf ABC DEF GHI OPQ XYZ
0 A 01/01/2019 100 20 0 0 20
1 B 01/01/2019 20 30 30 0 0
2 C 01/01/2019 57 45 0 45 0
3 A 02/01/2019 44 0 66 66 0
4 B 02/01/2019 14 0 15 0 14
5 C 02/01/2019 0 21 10 0 21

Converting a List of Pandas Series to a single Pandas DataFrame

I am using statsmodels.api on my data set. I have a list of panda series. The panda series has key value pairs. The keys are the names of the columns and the values contain the data. But, I have a list of series where the keys (column names) are repeated. I want to save all of the values from the list of pandas series to a single dataframe where the column names are the keys of the panda series. All of the series in the list have the same keys. I want to save them as a single data frame so that I can export the dataframe as a CSV. Any idea how I can save the keys as my column names of the df and then have the values fill the rest of the information.
Each series in the list returns something like this:
index 0 of the list: <class 'pandas.core.series.Series'>
height 23
weight 10
size 45
amount 9
index 1 of the list: <class 'pandas.core.series.Series'>
height 11
weight 99
size 25
amount 410
index 2 of the list: <class 'pandas.core.series.Series'>
height 3
weight 0
size 115
amount 92
I would like to be able to read a dataframe such that these values are saved as the following:
DataFrame:
height weight size amount
23 10 45 9
11 11 25 410
3 3 115 92
pd.DataFrame(data=your_list_of_series)
When creating a new DataFrame, pandas will accept a list of series for the data argument. The indices of your series will become the column names of the DataFrame.
Not the most efficient way, but this does the trick:
import pandas as pd
series_list =[ pd.Series({ 'height': 23,
'weight': 10,
'size': 45,
'amount': 9
}),
pd.Series({ 'height': 11,
'weight': 99,
'size': 25,
'amount': 410
}),
pd.Series({ 'height': 3,
'weight': 0,
'size': 115,
'amount': 92
})
]
pd.DataFrame( [series.to_dict() for series in series_list] )
Did you try just calling pd.DataFrame() on the list of series? That should just work.
import pandas as pd
series_list = [
pd.Series({
'height': 23,
'weight': 10,
'size': 45,
'amount': 9
}),
pd.Series({
'height': 11,
'weight': 99,
'size': 25,
'amount': 410
}),
pd.Series({
'height': 3,
'weight': 0,
'size': 115,
'amount': 92
})
]
df = pd.DataFrame(series_list)
print(df)
df.to_csv('path/to/save/foo.csv')
Output:
height weight size amount
0 23 10 45 9
1 11 99 25 410
2 3 0 115 92

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Split out if > value, divide, add value to column - Python/Pandas

import pandas as pd
df = pd.DataFrame([['Dog', 10, 6], ['Cat', 7 ,5]], columns=('Name','Amount','Day'))
Name Amount Day
Dog 10 6
Cat 7 5
I would like to make the DataFrame look like the following:
Name Amount Day
Dog1 6 6
Dog2 2.5 7
Dog3 1.5 8
Cat 7 5
First step: For any Amount > 8, split into 3 different rows, with new name of 'Name1', 'Name2','Name3'
Second step:
For Dog1, 60% of Amount, Day = Day.
For Dog2, 25% of Amount, Day = Day + 1.
For Dog3, 15% of Amount, Day = Day + 2.
Keep Cat the same because Cat Amount < 8
Any ideas? Any help would be appreciated.
df = pd.DataFrame([['Dog', 10, 6], ['Cat', 7 ,5]], columns=('Name','Amount','Day'))
template = pd.DataFrame([
['1', .6, 0],
['2', .25, 1],
['3', .15, 2]
], columns=df.columns)
def apply_template(r, t):
t = t.copy()
t['Name'] = t['Name'].radd(r['Name'])
t['Amount'] *= r['Amount']
t['Day'] += r['Day']
return t
pd.concat([apply_template(r, template) for _, r in df.query('Amount > 8').iterrows()],
ignore_index=True).append(df.query('Amount <= 8'), ignore_index=True)

Getting count of rows using groupby in Pandas

I have two columns in my dataset, col1 and col2. I want to display data grouped by col1.
For that I have written code like:
grouped = df[['col1','col2']].groupby(['col1'], as_index= False)
The above code creates the groupby object.
How do I use the object to display the data grouped as per col1?
To get the counts by group, you can use dataframe.groupby('column').size().
Example:
In [10]:df = pd.DataFrame({'id' : [123,512,'zhub1', 12354.3, 129, 753, 295, 610],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','colour', 'shape'])
In [11]:df
Out[11]:
id colour shape
0 123 black round
1 512 white triangular
2 zhub1 white triangular
3 12354.3 white triangular
4 129 black square
5 753 black triangular
6 295 white round
7 610 white triangular
In [12]:df.groupby('colour').size()
Out[12]:
colour
black 3
white 5
dtype: int64
In [13]:df.groupby('shape').size()
Out[13]:
shape
round 2
square 1
triangular 5
dtype: int64
Try groups attribute and get_group() method of the object returned by groupby():
>>> import numpy as np
>>> import pandas as pd
>>> anarray=np.array([[0, 31], [1, 26], [0, 35], [1, 22], [0, 41]])
>>> df = pd.DataFrame(anarray, columns=['is_female', 'age'])
>>> by_gender=df[['is_female','age']].groupby(['is_female'])
>>> by_gender.groups # returns indexes of records
{0: [0, 2, 4], 1: [1, 3]}
>>> by_gender.get_group(0)['age'] # age of males
0 31
2 35
4 41
Name: age, dtype: int64

Resources