Getting count of rows using groupby in Pandas - python-3.x

I have two columns in my dataset, col1 and col2. I want to display data grouped by col1.
For that I have written code like:
grouped = df[['col1','col2']].groupby(['col1'], as_index= False)
The above code creates the groupby object.
How do I use the object to display the data grouped as per col1?

To get the counts by group, you can use dataframe.groupby('column').size().
Example:
In [10]:df = pd.DataFrame({'id' : [123,512,'zhub1', 12354.3, 129, 753, 295, 610],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','colour', 'shape'])
In [11]:df
Out[11]:
id colour shape
0 123 black round
1 512 white triangular
2 zhub1 white triangular
3 12354.3 white triangular
4 129 black square
5 753 black triangular
6 295 white round
7 610 white triangular
In [12]:df.groupby('colour').size()
Out[12]:
colour
black 3
white 5
dtype: int64
In [13]:df.groupby('shape').size()
Out[13]:
shape
round 2
square 1
triangular 5
dtype: int64

Try groups attribute and get_group() method of the object returned by groupby():
>>> import numpy as np
>>> import pandas as pd
>>> anarray=np.array([[0, 31], [1, 26], [0, 35], [1, 22], [0, 41]])
>>> df = pd.DataFrame(anarray, columns=['is_female', 'age'])
>>> by_gender=df[['is_female','age']].groupby(['is_female'])
>>> by_gender.groups # returns indexes of records
{0: [0, 2, 4], 1: [1, 3]}
>>> by_gender.get_group(0)['age'] # age of males
0 31
2 35
4 41
Name: age, dtype: int64

Related

Edit multiple values with df.at()

Why does
>>> offset = 2
>>> data = {'Value': [7, 9, 21, 22, 23, 100]}
>>> df = pd.DataFrame(data=data)
>>> df.at[:offset, "Value"] = 99
>>> df
Value
0 99
1 99
2 99
3 22
4 23
5 100
change values in indices [0, 1, 2]? I would expect them only to be changed in [0, 1] to be conform with regular slicing.
Like when I do
>>> arr = [0, 1, 2, 3, 4]
>>> arr[0:2]
[0, 1]
.at behaves like .loc, in that it selects rows/columns by label. Label slicing in pandas is inclusive. Note that .iloc, which performs slicing on the integer positions, behaves like you would expect. See this good answer for a motivation.
Also note that the pandas documentation suggests to use .at only when selecting/setting single values. Instead, use .loc.
On line 4, when you say :2, it means all rows from 0 to 2 or 0:2. If you want to change only the 3rd row, you should change it to 2:2

Create Plotly plot for each column in df with a 'for' loop

I want to make multiple Plotly Scatter plots (one for each column) in a df using a for loop in Python. I also want to be able to show the plots by entering the column name.
See sample code:
import plotly.express as px
import pandas as pd
df = pd.DataFrame({'A': [3, 1, 2, 3],
'B': [5, 6, 7, 8],
'C': [2, 1, 6, 3]})
df
A B C
0 3 5 2
1 1 6 1
2 2 7 6
3 3 8 3
The closest I got is this:
for i in df.columns:
i = px.scatter(df,
x="A",
y=i)
But this fails to assign a value to each plot. I want to be able to show the plot for column A by entering A, and the plot for B by entering B, etc.
You can add the plots to a dict like
plots = {i: px.scatter(df, x="A", y=i) for i in df.columns}
which is equivalent but shorter than
plots = {}
for i in df.columns:
plots[i] = px.scatter(df, x="A", y=i)
and then you can show each plot with plots['A'], plots['B'] and plots['C']

find out percentage of duplicates

I have the following data:
id date A Area Price Hol
0 1 2019-01-01 No 80 200 No
1 2 2019-01-02 Yes 100 300 Yes
2 3 2019-01-03 Yes 100 300 Yes
3 4 2019-01-04 No 50 100 No
4 5 2019-01-05 No 20 50 No
5 1 2019-01-01 No 80 200 No
I want to find out duplicates (for the same id).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 1], 'date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-01'],
'A': ['No', 'Yes', 'Yes', 'No', 'No', 'No'],
'Area': [80, 100, 100, 50, 20, 80], 'Price': [200, 300, 300, 100, 50, 200],
'Hol': ['No', 'Yes', 'Yes', 'No', 'No', 'No']})
df['date'] = pd.to_datetime(df['date'])
fig, ax = plt.subplots(figsize=(15, 7))
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().plot(ax=ax)
I can see that I have one duplicate (for id 1 , all the entries are the same)
Now, I want to find out what percentage those duplicates represent in the whole dataset.
I can't find a way to express this, since I am already using value_counts() in order to find the duplicates and I can't do something like:
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().size()
percentage = (test / test.groupby(level=0).sum()) * 100
I believe you need DataFrame.duplicated with Series.value_counts:
percentage = df.duplicated(keep=False).value_counts(normalize=True) * 100
print (percentage)
False 66.666667
True 33.333333
dtype: float64
Is duplicated what you need ?
df.duplicated(keep=False).mean()
Out[107]: 0.3333333333333333

Converting a List of Pandas Series to a single Pandas DataFrame

I am using statsmodels.api on my data set. I have a list of panda series. The panda series has key value pairs. The keys are the names of the columns and the values contain the data. But, I have a list of series where the keys (column names) are repeated. I want to save all of the values from the list of pandas series to a single dataframe where the column names are the keys of the panda series. All of the series in the list have the same keys. I want to save them as a single data frame so that I can export the dataframe as a CSV. Any idea how I can save the keys as my column names of the df and then have the values fill the rest of the information.
Each series in the list returns something like this:
index 0 of the list: <class 'pandas.core.series.Series'>
height 23
weight 10
size 45
amount 9
index 1 of the list: <class 'pandas.core.series.Series'>
height 11
weight 99
size 25
amount 410
index 2 of the list: <class 'pandas.core.series.Series'>
height 3
weight 0
size 115
amount 92
I would like to be able to read a dataframe such that these values are saved as the following:
DataFrame:
height weight size amount
23 10 45 9
11 11 25 410
3 3 115 92
pd.DataFrame(data=your_list_of_series)
When creating a new DataFrame, pandas will accept a list of series for the data argument. The indices of your series will become the column names of the DataFrame.
Not the most efficient way, but this does the trick:
import pandas as pd
series_list =[ pd.Series({ 'height': 23,
'weight': 10,
'size': 45,
'amount': 9
}),
pd.Series({ 'height': 11,
'weight': 99,
'size': 25,
'amount': 410
}),
pd.Series({ 'height': 3,
'weight': 0,
'size': 115,
'amount': 92
})
]
pd.DataFrame( [series.to_dict() for series in series_list] )
Did you try just calling pd.DataFrame() on the list of series? That should just work.
import pandas as pd
series_list = [
pd.Series({
'height': 23,
'weight': 10,
'size': 45,
'amount': 9
}),
pd.Series({
'height': 11,
'weight': 99,
'size': 25,
'amount': 410
}),
pd.Series({
'height': 3,
'weight': 0,
'size': 115,
'amount': 92
})
]
df = pd.DataFrame(series_list)
print(df)
df.to_csv('path/to/save/foo.csv')
Output:
height weight size amount
0 23 10 45 9
1 11 99 25 410
2 3 0 115 92

Pandas: How to build a column based on another column which is indexed by another one?

I have this dataframe presented below. I tried a solution below, but I am not sure if this is a good solution.
import pandas as pd
def creatingDataFrame():
raw_data = {'code': [1, 2, 3, 2 , 3, 3],
'Region': ['A', 'A', 'C', 'B' , 'A', 'B'],
'var-A': [2,4,6,4,6,6],
'var-B': [20, 30, 40 , 50, 10, 20],
'var-C': [3, 4 , 5, 1, 2, 3]}
df = pd.DataFrame(raw_data, columns = ['code', 'Region','var-A', 'var-B', 'var-C'])
return df
if __name__=="__main__":
df=creatingDataFrame()
df['var']=np.where(df['Region']=='A',1.0,0.0)*df['var-A']+np.where(df['Region']=='B',1.0,0.0)*df['var-B']+np.where(df['Region']=='C',1.0,0.0)*df['var-C']
I want the variable var assumes values of column 'var-A', 'var-B' or 'var-C' depending on the region provided by region 'Region'.
The result must be
df['var']
Out[50]:
0 2.0
1 4.0
2 5.0
3 50.0
4 6.0
5 20.0
Name: var, dtype: float64
You can try with lookup
df.columns=df.columns.str.split('-').str[-1]
df
Out[255]:
code Region A B C
0 1 A 2 20 3
1 2 A 4 30 4
2 3 C 6 40 5
3 2 B 4 50 1
4 3 A 6 10 2
5 3 B 6 20 3
df.lookup(df.index,df.Region)
Out[256]: array([ 2, 4, 5, 50, 6, 20], dtype=int64)
#df['var']=df.lookup(df.index,df.Region)

Resources