What is the simplest way to use Python to group based on a combination of columns (4 columns) and sum the amount (1 column) column? - python-3.x

Using Python 3.6 and Pandas 0.23.0 to automate accounting.
I want to group 4 columns based on certain combined values (63 different combinations) and then sum the 5th column. Then take the output of those 63 different values to a 2 column output: Combination, Amount.
The 63 combinations will always be the same.
For example:
There are columns A, B, C, D, E.
Column A can have 3 values:
Ebay
Amazon
Shopify
Column B can have 5 values:
Sale
Refund
etc.
Column C can have 8 values:
StorePrice
StoreFee
Tax
TaxRefund
etc.
Column D can have 30 values:
SoldAmount
TaxAmount
PromotionAmount
RefundAmount
OtherAmount
etc.
Column E can have a numerical value:
-1,000,000 - 1,000,000
NOTE: The amount of unique combined values is 63 for our purpose. Refunds can’t be Promotions, etc.
I need to find the sum of Column E for each combination.
For perspective, this is typically done with a Pivot Table in excel, except I have to do it manually, so that is 63 different sorts. So I will group by Ebay, Sale, StorePrice, SoldAmount to get the summed amount of all Sold Ebay sales over a period.
I thought about storing a list of the 63 combinations in my code and then looping through the .txt file. Sum For w, x, y, z: sort of thing. Here is where I started and then got stuck:
import pandas as pd
data = pd.read_csv('/Users/XXX/Desktop/statement.txt', sep='\t', header=0)
df = pd.DataFrame(data)
test3 = df.groupby(['Column A','Column B', 'Column A', 'Column D']).sum()
This gets me close, but I'm stuck.
What is the simplest way to solve this problem? Any help is appreciated!

Your arg list should instead be this:
df.groupby(['Column A','Column B', 'Column C', 'Column D']).sum()
If you told us the actual and expected result,
we'd be in a better position to help you.
https://stackoverflow.com/help/mcve

Related

Excel - getting a value based on the max value off another row in a Table

I'm looking for a solution for a problem I'm facing in Excel. This is my table simplified:
Every sale has an unique ID, but more people can have contributed to a sale. the column "name" and "share of sales(%)" show how many people have contributed and what their percentage was.
Sale_ID
Name
Share of sales(%)
1
Person A
100
2
Person B
100
3
Person A
30
3
Person C
70
Now I want to add a column to my table that shows the name of the person that has the highest share of sales percentage per Sales_ID. Like this:
Sale_ID
Name
Share of sales(%)
Highest sales
1
Person A
100
Person A
2
Person B
100
Person B
3
Person A
30
Person C
3
Person C
70
Person C
So when multiple people have contributed the new column shows only the one with the highest value.
I hope someone can help me, thanks in advance!
You can try this on cell D2:
=LET(maxSales, MAXIFS(C2:C5,A2:A5,A2:A5),
INDEX(B2:B5, XMATCH(A2:A5&maxSales,A2:A5&C2:C5)))
or just removing the LET since maxSales is used only one time:
=INDEX(B2:B5, XMATCH(A2:A5&MAXIFS(C2:C5,A2:A5,A2:A5),A2:A5&C2:C5))
On cell E2 I provided another solution via MAP/XLOOKUP:
=LET(maxSales, MAXIFS(C2:C5,A2:A5,A2:A5),
MAP(A2:A5, maxSales, LAMBDA(a,b, XLOOKUP(a&b, A2:A5&C2:C5, B2:B5))))
similarly without LET:
=MAP(A2:A5, MAXIFS(C2:C5,A2:A5,A2:A5),
LAMBDA(a,b, XLOOKUP(a&b, A2:A5&C2:C5, B2:B5)))
and here is the output:
Explanation
The trick here is to identify the max share of sales per each group and this can be done via MAXIFS(max_range, criteria_range1, criteria1, [criteria_range2, criteria2], ...). The size and shape of the max_range and criteria_rangeN arguments must be the same.
MAXIFS(C2:C5,A2:A5,A2:A5)
it produces the following output:
maxSales
100
100
70
70
MAXIFS will provide an output of the same size as criteria1, so it returns for each row the corresponding maximum sales for each Sale_ID column value.
It is the array version equivalent to the following formula expanding it down:
MAXIFS($C$2:$C$5,$A$2:$A$5,A2)
INDEX/XMATCH Solution
Having the array with the maximum Shares of sales, we just need to identify the row position via XMATCH to return the corresponding B2:B5 cell via INDEX. We use concatenation (&) to consider more than one criteria to find as part of the XMATCH input arguments.
MAP/XLOOKUP Solution
We use MAP to find for each pair of values (a,b) per row, of the first two MAP input arguments where is the maximum value found for that group and returns the corresponding Name column value. In order to make a lookup based on an additional criteria we use concatenation (&) in XLOOKUP first two input arguments.

Count Unique Dates Associated with Location

I am trying to count the total of Unique Dates based on the location.
Context: I trying to create a formula for counting the number of unique dates based on location. My Spreadsheet looks like this
A B C
1 **Participant Location Date**
2 Participant-A High School X 11/7
3 Participant-B High School X 11/7
4 Participant-C High School X 11/8
5 Participant-E High School Y 11/7
6 Participant-F High School Z 11/7
7 Participant-G High School Z 11/8
So for example: high School X had 2 different dates. What would the formula be to count the unique dates based on the location?
This is also being completed on google sheets.
Thank you!
Another way (with no helper columns) would be to use query() and unique().
=query(unique(B:C), "Select Col1, count(Col2) where Col1 <>'' group by Col1 label count(Col2)'# of unique dates'", 1)
Illustration:
With a simple helper column :
=1/COUNTIFS($A$2:$A$7,A7,$B$2:$B$7,B7)
And to get your results :
=SUMIF($A$2:$A$7,E2,$C$2:$C$7)
This is not one-formula solution but I think it works. First, create a third column concatenating the columns that you want to compare. In this case, at cell D2 write:
=CONCATENATE(B2,C2)
This is for the first row of your example. Then, replicate that to the following rows.
Finally, create a formula that counts unique values:
=SUM(IF(FREQUENCY(IF(LEN(D2:D7)>0,MATCH(D2:D7,D2:D7,0),""), IF(LEN(D2:D7)>0,MATCH(D2:D7,D2:D7,0),""))>0,1))
Assuming your new column of concatenated values is at D2:D7.

Spreadsheet - ordering a column twice, using different scores (from 2 other colums)

I am working on a spreadsheet - currently google docs but happy to see answers relating to other spreadsheet software.
I have a list of foods (column A - Food)
I have list1 of "scores" (column B - Score1)
I have list2 of "scores" (column C - Score2)
I would like to add two new columns, ideally ordering the food from column A according to the scores, both list1 and list2 - so one new column ordering the foods based on the score1 from column B, and the other new columns based on the score2 from column C.
An example usually helps, so here is what I have:
Food Score1 Score2
a 12 45
b 96 67
c 100 32
Now, this would be "Version 1", on the way to getting what I would like:
Food Score1 Score2 Order1 Order2
a 12 45 3 2
b 96 67 2 1
c 100 32 1 3
Or, even better, "Version 2" - use the food name in the new columns, in the right order according to scores:
Food Score1 Score2 FoodScore1 FoodScore2
a 12 45 c b
b 96 67 b a
c 100 32 a c
I suspect that getting "Version 1" is probably achievable (but don't know how to do it)
I suspect that getting "Version 2" is not possible without some sort of procedural programming?
Hope someone can help!
Cheers
Or, even better, "Version 2" - use the food name in the new columns, in the right order according to scores:
Let A2:A10 - is your food range, B2:B10 - score1 range, D2:D10 - destination range ( FoodScore1 in your example)
Works both in EXCEL and in GOOGLE-SPREADSHEETS:
=INDEX($A$2:$A$10,MATCH(LARGE($B$2:$B$10,1+ROW(A2)-ROW($A$2)),$B$2:$B$10,0))
enter this formula in D2 and drag it down
If formula will give you an error, try to change , to ; (depends on your local settings).
P.s. for score2 formula would be the same, just change ranges from score1 to score2 (i.e. $B$2:$B$10 to $C$2:$C$10)
I would start by adding a rank column, with number 1 for the highest score, number 2 for the second highest, etc.
In Microsoft Excel, assuming that the first column is A and that all the scores are unique, you could simply have a formula like
=COUNT.IF(B:B, ">=" & $B1)
in column D1, and similarly in column E for the second score.
Then if you fill column F with the ranks 1, 2, 3, ... you can simply do a VLOOKUP. Or, in this case, an equivalent solution with INDEX and MATCH - because you want to lookup the rank in column F and return the corresponding value from column A.
As you are working in a Google Spreadsheet, one would expect to see some Googliness in the solutions provided. Use this very simple, only usable in Google Spreadsheet, formula.
Formula
// FoodScore1
=QUERY(B2:D4, "SELECT B ORDER BY C DESC")
// FoodScore2
=QUERY(B2:D4, "SELECT B ORDER BY D DESC")
Screenshot
Explained
The data range of the QUERY function is simply B2:D4. Then a quasi SELECT statement is made to select only column B and ordered by column C or D, descendingly.
Reference
https://developers.google.com/chart/interactive/docs/querylanguage

Find the top n values in a range while keeping the sum of values in another range under x value

I'd like to accomplish the following task. There are three columns of data. Column A represents price, where the sum needs to be kept under $100,000. Column B represents a value. Column C represents a name tied to columns A & B.
Out of >100 rows of data, I need to find the highest 8 values in column B while keeping the sum of the prices in column A under $100,000. And then return the 8 names from column C.
Can this be accomplished?
EDIT:
I attempted the Solver solution w/ no luck. 200 rows looks to be the max w/ Solver, and that is what I'm using now. Here are the steps I've taken:
Create a column called rank RANK(B2,$B$2:$B$200) (used column D -- what is the purpose of this?)
Create a column called flag just put in zeroes (used column E)
Create 3 total cells total_price (=SUM(A2:A200)), total_value (=SUM(B2:B200)) and total_flag (=(E2:E200))
Use solver to minimize total_value (shouldn't this be maximize??)
Add constraints -Total_price<=100000 -Total_flag=8 -Flag cells are binary
Using Simplex LP, it simply changes the flags for the first 8 values. However, the total price for the first 8 values is >$100,000 ($140k). I've tried changing some options in the Solver Parameters as well as using different solving methods to no avail. I'd like to post an image of the parameter settings, but don't have enough "reputation".
EDIT #2:
The first 5 rows looks like this, price goes down to ~$6k at the bottom of the table.
Price Value Name Rank Flag
$22,538 42.81905675 Blow, Joe 1 0
$22,427 37.36240932 Doe, Jane 2 0
$17,158 34.12127693 Hall, Cliff 3 0
$16,625 33.97654031 Povich, John 4 0
$15,631 33.58212402 Cow, Holy 5 0
I'll give you the solver solution as a starting point. It involves the creation of some extra columns and total cells. Note solver is limited in the amount of cells it can handle but will work with 100 anyway.
Create a column called rank RANK(B2,$B$2:$B$100)
Create a column called flag just put in zeroes
Create 3 total cells total_price, total_value and total_flag
Use solver to minimize total_value
Add constraints
-Total_price<=100000
-Total_flag=8
-Flag cells are binary
This will flag the rows you want and you can grab the names however you want.

Pivot chart count number of serial samples not repeated samples

I have an excel database that consist of ID, Name and Sample Date. I made a pivot table to count the number of names and samples each name has but I want a count of the number of serial samples each name has, not including repeated samples.
Here is an example database:
ID Name Sample Date
M1.1 A 8/2/2013
M2.1a B 8/6/2013
M2.1b B 8/6/2013
M2.1c A 8/6/2013
M1.2 A 8/7/2013
M3.1 C 8/9/2013
M4.1 D 8/10/2013
M1.3 A 8/11/2013
M2.2 B 8/13/2013
I want the pivot table to be able to count that A has 4 serial samples, B has 2 serial samples instead of 3, C has 1, and D has 1.
Any suggestions on how to do this?
Not elegant but for want of any other answer so far:
Add a helper column containing =COUNTIFS(B:B,B2,C:C,C2) and copy down to suit. Order on that helper column and restrict the PivotTable range to include only half of those entries that show up as duplicates.

Resources