python -count elements pandas dataframe - python-3.x

I have a table with some info about districts. I have converted it into a pandas dataframe and my question is how can I count how many times SOUTHERN, BAYVIEW etc. appear in the table below? I want to add an extra column next to District with the total number of each district.
District
0 SOUTHERN
1 BAYVIEW
2 CENTRAL
3 NORTH

Here you need to use a groupby and a size method (you can also use some other aggregations such as count)
With this dataframe:
import pandas as pd
df = pd.DataFrame({'DISTRICT': ['SOUTHERN', 'SOUTHERN', 'BAYVIEW', 'BAYVIEW', 'BAYVIEW', 'CENTRAL', 'NORTH']})
Represented as below
DISTRICT
0 SOUTHERN
1 SOUTHERN
2 BAYVIEW
3 BAYVIEW
4 BAYVIEW
5 CENTRAL
6 NORTH
You can use
df.groupby(['DISTRICT']).size().reset_index(name='counts')
You have this output
DISTRICT counts
0 BAYVIEW 3
1 CENTRAL 1
2 NORTH 1
3 SOUTHERN 2

Related

Perform unique row operation after a groupby

I have been stuck to a problem where I have done all the groupby operation and got the resultant dataframe as shown below but the problem came in last operation of calculation of one additional column
Current dataframe:
code industry category count duration
2 Retail Mobile 4 7
3 Retail Tab 2 33
3 Health Mobile 5 103
2 Food TV 1 88
The question: Want an additional column operation which calculates the ratio of count of industry 'retail' for the specific code column entry
for example: code 2 has 2 industry entry retail and food so operation column should have value 4/(4+1) = 0.8 and similarly for code3 as well as shown below
O/P:
code industry category count duration operation
2 Retail Mobile 4 7 0.8
3 Retail Tab 2 33 -
3 Health Mobile 5 103 2/7 = 0.285
2 Food TV 1 88 -
Help on here as well that if I do just groupby I will miss out the information of category and duration also what would be better way to represent the output df there can been multiple industry and operation is limited to just retail
I can't think of a single operation. But the way via a dictionary should work. Oh, and in advance for the other answerers the code to create the example dataframe.
st_l = [[2,'Retail','Mobile', 4, 7],
[3,'Retail', 'Tab', 2, 33],
[3,'Health', 'Mobile', 5, 103],
[2,'Food', 'TV', 1, 88]]
df = pd.DataFrame(st_l, columns=
['code','industry','category','count','duration'])
And now my attempt:
sums = df[['code', 'count']].groupby('code').sum().to_dict()['count']
df['operation'] = df.apply(lambda x: x['count']/sums[x['code']], axis=1)
You can create a new column with the total count of each code using groupby.transform(), and then use loc to find only the rows that have as their industry 'Retail' and perform your division:
df['total_per_code'] = df.groupby(['code'])['count'].transform('sum')
df.loc[df.industry.eq('Retail'), 'operation'] = df['count'].div(df.total_per_code)
df.drop('total_per_code',axis=1,inplace=True)
prints back:
code industry category count duration operation
0 2 Retail Mobile 4 7 0.800000
1 3 Retail Tab 2 33 0.285714
2 3 Health Mobile 5 103 NaN
3 2 Food TV 1 88 NaN

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How to access a column of grouped data to perform linear regression in pandas?

I want to perform a linear regression on groupes of grouped data frame in pandas. The function I am calling throws a KeyError that I cannot resolve.
I have an environmental data set called dat that includes concentration data of a chemical in different tree species of various age classes in different country sites over the course of several time steps. I now want to do a regression of concentration over time steps within each group of (site, species, age).
This is my code:
```
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('data.csv')
dat.head(15)
SampleName Concentration Site Species Age Time_steps
0 batch1 2.18 Germany pine 1 1
1 batch2 5.19 Germany pine 1 2
2 batch3 11.52 Germany pine 1 3
3 batch4 16.64 Norway spruce 0 1
4 batch5 25.30 Norway spruce 0 2
5 batch6 31.20 Norway spruce 0 3
6 batch7 12.63 Norway spruce 1 1
7 batch8 18.70 Norway spruce 1 2
8 batch9 43.91 Norway spruce 1 3
9 batch10 9.41 Sweden birch 0 1
10 batch11 11.10 Sweden birch 0 2
11 batch12 15.73 Sweden birch 0 3
12 batch13 16.87 Switzerland beech 0 1
13 batch14 22.64 Switzerland beech 0 2
14 batch15 29.75 Switzerland beech 0 3
def ols_res_grouped(group):
xcols_const = sm.add_constant(group['Time_steps'])
linmod = sm.OLS(group['Concentration'], xcols_const).fit()
return linmod.params[1]
grouped = dat.groupby(['Site','Species','Age']).agg(ols_res_grouped)
```
I want to get the regression coefficient of concentration data over Time_steps but get a KeyError: 'Time_steps'. How can the sm method access group["Time_steps"]?
According to pandas's documentation, agg applies functions to each column independantly.
It might be possible to use NamedAgg but I am not sure.
I think it is a lot easier to just use a for loop for this :
for _, group in dat.groupby(['Site','Species','Age']):
coeff = ols_res_grouped(group)
# if you want to put the coeff inside the dataframe
dat.loc[group.index, 'coeff'] = coeff

Sorting and Grouping in Pandas data frame column alphabetically

I want to sort and group by a pandas data frame column alphabetically.
a b c
0 sales 2 NaN
1 purchase 130 230.0
2 purchase 10 20.0
3 sales 122 245.0
4 purchase 103 320.0
I want to sort column "a" such that it is in alphabetical order and is grouped as well i.e., the output is as follows:
a b c
1 purchase 130 230.0
2 10 20.0
4 103 320.0
0 sales 2 NaN
3 122 245.0
How can I do this?
I think you should use the sort_values method of pandas :
result = dataframe.sort_values('a')
It will sort your dataframe by the column a and it will be grouped either because of the sorting. See ya !

How to group by two Columns using Pandas?

I am working on an algorithm, which requires grouping by two columns. Pandas supports grouping by two columns by using:
df.groupby([col1, col2])
But the resulting dataframe is not the required dataframe
Work Setup:
Python : v3.5
Pandas : v0.18.1
Pandas Dataframe - Input Data:
Type Segment
id
1 Domestic 1
2 Salary 3
3 NRI 1
4 Salary 4
5 Salary 3
6 NRI 4
7 Salary 4
8 Salary 3
9 Salary 4
10 NRI 4
Required Dataframe:
Count of [Domestic, Salary, NRI] in each Segment
Domestic Salary NRI
Segment
1 1 3 1
3 0 0 0
4 0 3 2
Experiments:
group = df.groupby(['Segment', 'Type'])
group.size()
Segment Type Count
1 Domestic 1
NRI 1
3 Salary 3
4 Salary 3
NRI 2
I am able to achieve the required dataframe using MS Excel Pivot Table feature. Is there any way, where I can achieve similar results using pandas?
After the Groupby.size operation, a multi-index(2 level index) series object gets created that needs to be converted into a dataframe, which could be done by unstacking the 2nd level index and optionally filling NaNs obtained with 0.
df.groupby(['Segment', 'Type']).size().unstack(level=1, fill_value=0)

Resources