I have been stuck to a problem where I have done all the groupby operation and got the resultant dataframe as shown below but the problem came in last operation of calculation of one additional column
Current dataframe:
code industry category count duration
2 Retail Mobile 4 7
3 Retail Tab 2 33
3 Health Mobile 5 103
2 Food TV 1 88
The question: Want an additional column operation which calculates the ratio of count of industry 'retail' for the specific code column entry
for example: code 2 has 2 industry entry retail and food so operation column should have value 4/(4+1) = 0.8 and similarly for code3 as well as shown below
O/P:
code industry category count duration operation
2 Retail Mobile 4 7 0.8
3 Retail Tab 2 33 -
3 Health Mobile 5 103 2/7 = 0.285
2 Food TV 1 88 -
Help on here as well that if I do just groupby I will miss out the information of category and duration also what would be better way to represent the output df there can been multiple industry and operation is limited to just retail
I can't think of a single operation. But the way via a dictionary should work. Oh, and in advance for the other answerers the code to create the example dataframe.
st_l = [[2,'Retail','Mobile', 4, 7],
[3,'Retail', 'Tab', 2, 33],
[3,'Health', 'Mobile', 5, 103],
[2,'Food', 'TV', 1, 88]]
df = pd.DataFrame(st_l, columns=
['code','industry','category','count','duration'])
And now my attempt:
sums = df[['code', 'count']].groupby('code').sum().to_dict()['count']
df['operation'] = df.apply(lambda x: x['count']/sums[x['code']], axis=1)
You can create a new column with the total count of each code using groupby.transform(), and then use loc to find only the rows that have as their industry 'Retail' and perform your division:
df['total_per_code'] = df.groupby(['code'])['count'].transform('sum')
df.loc[df.industry.eq('Retail'), 'operation'] = df['count'].div(df.total_per_code)
df.drop('total_per_code',axis=1,inplace=True)
prints back:
code industry category count duration operation
0 2 Retail Mobile 4 7 0.800000
1 3 Retail Tab 2 33 0.285714
2 3 Health Mobile 5 103 NaN
3 2 Food TV 1 88 NaN
Related
I am currently doing a groupby in pandas like this:
df.groupby(['grade'])['students'].nunique())
and the result I get is this:
grade
grade 1 12
grade 2 8
grade 3 30
grade 4 2
grade 5 600
grade 6 90
Is there a way to get the output such that I see the groups of the top 3, and everything else is classified under other?
this is what I am looking for
grade
grade 3 30
grade 5 600
grade 6 90
other (3 other grades) 22
I think you can add a helper column in the df and call it something like "Grouping".
name the top 3 rows with its original name and name the remaining as "other" and then just group by the "Grouping" column.
Can't do much without the actual input data, but if this is your starting dataframe (df) after your groupby -
grade unique
0 grade_1 12
1 grade_2 8
2 grade_3 30
3 grade_4 2
4 grade_5 600
5 grade_6 90
You can do a few more steps to get to your table -
ddf = df.nlargest(3, 'unique')
ddf = ddf.append({'grade': 'Other', 'unique':df['unique'].sum()-ddf['unique'].sum()}, ignore_index=True)
grade unique
0 grade_5 600
1 grade_6 90
2 grade_3 30
3 Other 22
I have DataFrame which consist of 3 columns: CustomerId, Amount and Status(success or failed).
The DataFrame is not sorted in any way. A CustomerId can repeat multiple times in DataFrame.
I want to introduce new columns into this DataFrame with below logic:
df[totalamount]= sum of amount for each customer where status was success.
I already have a running code but with df.iterrows which takes too much time. Thus requesting you to kindly provide alternate methods like pandas vectorization or numpy vectorization.
For Example, I want to create the 'totalamount' column from the first three columns:
CustomerID Amount Status totalamount
0 1 5 Success 105 # since both transatctions were successful
1 2 10 Failed 80 # since one transaction was successful
2 3 50 Success 50
3 1 100 Success 105
4 2 80 Success 80
5 4 60 Failed 0
Use where to mask the 'Failed' rows with NaN while preserving the length of the DataFrame. Then groupby the CustomerID and transform the sum of 'Amount' column to bring the result back to every row.
df['totalamount'] = (df.where(df['Status'].eq('Success'))
.groupby(df['CustomerID'])['Amount']
.transform('sum'))
CustomerID Amount Status totalamount
0 1 5 Success 105.0
1 2 10 Faled 80.0
2 3 50 Success 50.0
3 1 100 Success 105.0
4 2 80 Success 80.0
5 4 60 Failed 0.0
The reason for using where (as opposed to subsetting the DataFrame) is because groupby + sum defaults to sum an entirely NaN group to 0, so we don't need anything extra to deal with CustomerID 4, for instance.
df_new = df.groupby(['CustomerID', 'Status'], sort=False)['Amount'].sum().reset_index()
df_new = (df_new[df_new['Status'] == 'Success']
.drop(columns='Status')
.rename(columns={'Amount': 'totalamount'}))
df = pd.merge(df, df_new , on=['CustomerID'], how='left')
I'm not sure at all but I think this may work
I am trying to perform a window operation on the following pandas data frame:
import pandas as pd
df = pd.DataFrame({'visitor_id': ['a','a','a','a','a','a','b','b','b','b','c','c','c','c','c'],
'time_on_site' : [3,5,6,4,5,3,7,6,7,8,1,2,2,1,2],
'site_visit': [1,2,3,4,5,6,1,2,3,4,1,2,3,4,5],
'feature_visit' : [np.nan,np.nan,1,np.nan,2,3,1,2,3,4,np.nan,1,2,3,np.nan]
})
"For each distinct user, calculate the average time they spent on the website and their total number of visits before they interacted with a feature."
The data consists of four columns with the following definitions:
visitor_id is a string that identifies a unique given visitor
time_on_site is the time they spent on the website
site_visit is an incrementing counter of the times they visited the
website.
feature_visit is an incrementing counter of the times they used a specific feature on the site. If a customer visited the site before they interacted with the feature, a NaN is produced. If they visited the site and did not interact with the new feature, a NaN is produced. For each time they visited the site and interacted with the feature, the counter is incremented by one.
visitor_id time_on_site site_visit feature_visit
a 3 1 NaN
a 5 2 NaN
a 6 3 1
a 4 4 NaN
a 5 5 2
a 3 6 3
b 7 1 1
b 6 2 2
b 7 3 3
b 8 4 4
c 1 1 NaN
c 2 2 1
c 2 3 2
c 1 4 3
c 2 5 NaN
The expected output should look like this:
id mean count
a 4 2
b NaN 0
c 1 1
Which was created based on the following logic:
For user a, the expected output is 4, which is the average time_on_site for site_visit 1 and 2, which occurred before the first feature interaction on site_visit 3.
For user b the average time should be NaN because they had no prior visits before their first interaction with the feature.
For user c, their average time is just 1, since they only had one visit before interacting with the new feature.
If a user never used the new feature, their mean and count should be NaN.
Thanks in advance for the help.
Try this:
def summarize(x):
index = x[x['feature_visit'].notnull()].index[0]
return pd.Series({
'mean': x[x.index < index]['time_on_site'].mean(),
'count': x[x.index < index]['site_visit'].count()
})
df.groupby('visitor_id').apply(summarize)
I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php
I am working on an algorithm, which requires grouping by two columns. Pandas supports grouping by two columns by using:
df.groupby([col1, col2])
But the resulting dataframe is not the required dataframe
Work Setup:
Python : v3.5
Pandas : v0.18.1
Pandas Dataframe - Input Data:
Type Segment
id
1 Domestic 1
2 Salary 3
3 NRI 1
4 Salary 4
5 Salary 3
6 NRI 4
7 Salary 4
8 Salary 3
9 Salary 4
10 NRI 4
Required Dataframe:
Count of [Domestic, Salary, NRI] in each Segment
Domestic Salary NRI
Segment
1 1 3 1
3 0 0 0
4 0 3 2
Experiments:
group = df.groupby(['Segment', 'Type'])
group.size()
Segment Type Count
1 Domestic 1
NRI 1
3 Salary 3
4 Salary 3
NRI 2
I am able to achieve the required dataframe using MS Excel Pivot Table feature. Is there any way, where I can achieve similar results using pandas?
After the Groupby.size operation, a multi-index(2 level index) series object gets created that needs to be converted into a dataframe, which could be done by unstacking the 2nd level index and optionally filling NaNs obtained with 0.
df.groupby(['Segment', 'Type']).size().unstack(level=1, fill_value=0)