Initializing new columns in pd.pivot_table using existing columns within pd.pivot_table - python-3.x

I have table imported into python using pd.read_csv that looks as follows
I need to perform 3 activities on this table as follows
Calculate the number of free apps and paid apps in each genre using group by.
I wrote following code to get the desired output `
df.groupby(['prime_genre','Subscription'])[['id']].count()`
Output:
Convert the result from 1 into a dataframe having the columns as prime_genre, free, paid and the rows having the count of each
I wrote the following code to get the desired output
df1 = df.groupby(['prime_genre','Subscription'])['id'].count().reset_index()
df1.pivot_table(index='prime_genre', columns='Subscription', values='id', aggfunc='sum')
Output:
Now I need to initialize a column 'Total' that captures the sum of 'free app' and 'paid app' within the pivot table itself
I also I need to initialize two more columns perc_free and perc_paid that displays the percentage of free apps and paid apps within the pivot table itself
How do I go about 3 & 4?

Assuming the following pivot table named df2:
subscription free app paid app
prime_genre
book 66 46
business 20 37
catalogs 9 1
education 132 321
You can calculate the total using pandas.DataFrame.sum on the columns (axis=1). Then divide df2 with this total and multiply by 100 to get the percentage. You can add a suffix to the columns with pandas.DataFrame.add_suffix. Finally, combine everything with pandas.concat:
total = df2.sum(axis=1)
percent = df2.div(total, axis=0).mul(100).add_suffix(' percent')
df2['Total'] = total
pd.concat([df2, percent], axis=1)
output:
subscription free app paid app Total free app percent paid app percent
prime_genre
book 66 46 112 58.928571 41.071429
business 20 37 57 35.087719 64.912281
catalogs 9 1 10 90.000000 10.000000
education 132 321 453 29.139073 70.860927
Here is a variant to get the perc_free / perc_paid names:
total = df2.sum(axis=1)
percent = (df2.div(total, axis=0)
.mul(100)
.rename(columns=lambda x: re.sub('(.*)( app)', r'perc_\1',x))
)
df2['Total'] = total
pd.concat([df2, percent], axis=1)
subscription free app paid app Total perc_free perc_paid
prime_genre
book 66 46 112 58.928571 41.071429
business 20 37 57 35.087719 64.912281
catalogs 9 1 10 90.000000 10.000000
education 132 321 453 29.139073 70.860927

Related

Perform unique row operation after a groupby

I have been stuck to a problem where I have done all the groupby operation and got the resultant dataframe as shown below but the problem came in last operation of calculation of one additional column
Current dataframe:
code industry category count duration
2 Retail Mobile 4 7
3 Retail Tab 2 33
3 Health Mobile 5 103
2 Food TV 1 88
The question: Want an additional column operation which calculates the ratio of count of industry 'retail' for the specific code column entry
for example: code 2 has 2 industry entry retail and food so operation column should have value 4/(4+1) = 0.8 and similarly for code3 as well as shown below
O/P:
code industry category count duration operation
2 Retail Mobile 4 7 0.8
3 Retail Tab 2 33 -
3 Health Mobile 5 103 2/7 = 0.285
2 Food TV 1 88 -
Help on here as well that if I do just groupby I will miss out the information of category and duration also what would be better way to represent the output df there can been multiple industry and operation is limited to just retail
I can't think of a single operation. But the way via a dictionary should work. Oh, and in advance for the other answerers the code to create the example dataframe.
st_l = [[2,'Retail','Mobile', 4, 7],
[3,'Retail', 'Tab', 2, 33],
[3,'Health', 'Mobile', 5, 103],
[2,'Food', 'TV', 1, 88]]
df = pd.DataFrame(st_l, columns=
['code','industry','category','count','duration'])
And now my attempt:
sums = df[['code', 'count']].groupby('code').sum().to_dict()['count']
df['operation'] = df.apply(lambda x: x['count']/sums[x['code']], axis=1)
You can create a new column with the total count of each code using groupby.transform(), and then use loc to find only the rows that have as their industry 'Retail' and perform your division:
df['total_per_code'] = df.groupby(['code'])['count'].transform('sum')
df.loc[df.industry.eq('Retail'), 'operation'] = df['count'].div(df.total_per_code)
df.drop('total_per_code',axis=1,inplace=True)
prints back:
code industry category count duration operation
0 2 Retail Mobile 4 7 0.800000
1 3 Retail Tab 2 33 0.285714
2 3 Health Mobile 5 103 NaN
3 2 Food TV 1 88 NaN

Cumulatively Reduce Values in Column by Results of Another Column

I am dealing with a dataset that shows duplicate stock per part and location. Orders from multiple customers are coming in and the stock was just added via a vlookup. I need help writing some sort of looping function in python that cumulatively decreases the stock quantity by the order quantity.
Currently data looks like this:
SKU Plant Order Stock
0 5455 989 2 90
1 5455 989 15 90
2 5455 990 10 80
3 5455 990 20 80
I want to accomplish this:
SKU Plant Order Stock
0 5455 989 2 88
1 5455 989 15 73
2 5455 990 10 70
3 5455 990 20 50
Try:
df.Stock -= df.groupby(['SKU','Plant'])['Order'].cumsum()

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

Excel - Show summarised counts of average, min and max value from range of data

I have 2 sets of data, set 1 is a list of accounts and a count of items for each account from set 2. Set 2 data is a breakdown of the values for each account.
The examples below are representative of the data I'm working with, and there are 1000's of records so I'm after a formulaic way of doing this.
Set 1
Account Count
123 3
128 2
135 4
157 5
...
Set 2
Account Value
123 11
123 31
123 98
128 77
128 99
...
What I'd like to do, is show a set of data containing the average of the values for each account and the min and max value.
For example:
Output
Account Count Average Min Max
123 3 47 11 98
128 2 88 77 99
...
Ignore the first set and merely use Set 2 as a basis for a pivot table.
Then drag the column Value four times into the Value box and summarize by Count, Average, Min, and Max.

EXCEL: Count of Column Items For Every Distinct Item In Another

In my Excel sheet, I have 2 columns. Names of restaurants, and ratings for each one. For each rating, a new row is created, so of course a restaurant name occurs multiple times.
Restaurant Rating
McDonalds 8
McDonalds 7
Red Robin 5
Qdoba 7
Etc.
How can I get the number of times each rating happens for each restaurant? We'll say rating goes from 1-10.
I want it to look like this:
Restaurant 1(rating) 2 3 4 5 6 7 8 9 10
McDonalds 889 22 45 77 484 443 283 333 44 339
Any help is appreciated!
Using Pivot Tables:
Use a pivot tables to set your rows at "Restaurant" and your columns as "Rating" and your values as "Count of Rating"
Using Countifs:

Resources