Aggregating variables with similar first digit - statistics

In the table below each item has a 2-digit code. The first digit signifies a category.
I want to aggregate the items with the same first digit per person using Stata. Thus, the solution will be:
In this table, item1 = item11+item14+item15+item17 and item2=item21+item25 that are calculated per person.

clear
input str1 person item11 item21 item14 item15 item25 item17
a 2 3 5 1 3 50
end
egen item1 = rowtotal(item1*)
egen item2 = rowtotal(item2*)
drop item1? item2?
list
+------------------------+
| person item1 item2 |
|------------------------|
1. | a 58 6 |
+------------------------+

Related

Pandas, combine unique value from two column into one column while preserving order

I have data in four column as shown below. There are some values which are present in column 1, and some value of column 1 is again duplicated in column 3. I would like to combine column 1 with 3, while removing the duplicates from column 3. I would also like to preserve the order of column. Column 1 is associated with column 2 and column 3 is associated with column 4, so it would be nice if I can move column 1 items with column 2 and column 3 items with column 4 during merge. Any help will be appreciated.
Input table:
Item
Price
Item
Price
Car
105
Truck
54822
Chair
20
Pen
1
Cup
2
Car
105
Glass
1
Output table:
Item
Price
Car
105
Chair
20
Cup
2
Truck
54822
Pen
1
Glass
1
Thank you in advance.
After separating the input table into the left and right part, we can concatenate the left hand items with the unduplicated right hand items quite simply with boolean indexing:
import pandas as pd
# this initial section only recreates your sample input table
from io import StringIO
input = pd.read_table(StringIO("""| Item | Price | Item | Price |
|-------|-------|------|-------|
| Car | 105 | Truck| 54822 |
| Chair | 20 | Pen | 1 |
| Cup | 2 | Car | 105 |
| | | Glass| 1 |
"""), ' *\| *', engine='python', usecols=[1,2,3,4], skiprows=[1], keep_default_na=False)
input.columns = list(input.columns[:2])*2
# now separate the input table into the left and right part
left = input.iloc[:,:2].replace("", pd.NA).dropna().set_index('Item')
right = input.iloc[:,2:] .set_index('Item')
# finally construct the output table by concatenating without duplicates
output = pd.concat([left, right[~right.index.isin(left.index)]])
Price
Item
Car 105
Chair 20
Cup 2
Truck 54822
Pen 1
Glass 1

Excel Summation of Multiple Conditional Maximum Values

I am working on getting pricing based on the number of units we are ordering using Excel sumifs. The data I have looks something like this:
A B C D
Item1 Comp1 1 4.99
Item1 Comp1 10 3.99
Item1 Comp1 100 2.99
Item1 Comp2 1 13.99
Item1 Comp2 100 10.99
Item1 Comp3 1 2.99
Item1 Comp3 10 2.59
Item1 Comp3 50 2.19
Item1 Comp3 100 1.99
... ... ... ...
Where column A is the main item, column B is the individual components of the item in column A, and column C is the number we need to order in order to get the price listed in column D.
In a separate sheet, I have the following table:
A B C
Item1 10 FORMULA
Item2 5 FORMULA
Item3 20 FORMULA
... ... ...
The point of this sheet is to have the Item name as seen in Column A of the first table, column B holds the number we need to order, and column C (hopefully) lists the total price by adding all the components at their respective price breaks.
In this example, the sum for Item1 I am looking for is 3.99 + 13.99 + 2.59 = 20.57 because 10 items gets the 10 price break for component 1, the 1 price break for component 2, and the 10 price break for component 3.
So far I am able to sum the cost based on the item name in column C:
=SUMIFS(Table1[D], Table1[A], "="A2)
I am having trouble starting the second part which is basically only summing the maximum price break for each component where Table1[C] <= B2.

I have data stored in excel where I need to sort that data

In excel, I have data divided into
Year Code Class Count
2001 RAI01 LNS 9
2001 RAI01 APRP 4
2001 RAI01 3
2002 RAI01 BPR 3
2002 RAI01 BRK 3
2003 RAI01 URE 3
2003 CFCOLLTXFT APRP 2
2003 CFCOLLTXFT BPR 2
2004 CFCOLLTXFT GRL 2
2004 CFCOLLTXFT HDS 2
2005 RAI HDS 2
where I need to find the top 3 products for that particular customer for that particular year.
The real trick here is to rank each row based on a group.
Your rank is determined by your Count column (Column D).
Your group is determined by your Year and Code (I think) columns (Column A and B respectively).
You can use this gnarly sumproduct() formula to get a rank (Starting at 1) based on the Count for each Group.
So to get a ranking for each Year and Code from 1 to whatever, in a new column next to this data:
=SUMPRODUCT(($A$2:$A$50=A2)*(B2=$B$2:$B$50)*(D2<$D$2:$D$50))+1
And copy that down. Now you can AutoFilter on this to show all rows that have a rank less than 4. You can sort this on Customer, then Year and you should have a nice list of top 3 within each year/code.
Explanation of sumproduct.
Sumproduct goes row by row and applies the math that is defined for each row. When it is done it sums the results.
As an example, take the following worksheet:
+---+---+---+
| | A | B |
+---+---+---+
| 1 | 1 | 1 |
| 2 | 1 | 4 |
| 3 | 2 | 2 |
| 4 | 4 | 1 |
| 5 | 1 | 2 |
+---+---+---+
`=SUMPRODUCT((A1:A5)*(B1:B5))`
This sumproduct will take A1*B1, A2*B2, A3*B3, A4*B4, A5*B5 and then add those five results up to give you a number. That is 1 + 4 + 4 + 4 + 1 = 15
It will also work on conditional/boolean statements returning, for each row/condition a 1 or a 0 (for True and False, which is a "Boolean" value).
As an example, take the following worksheet that holds the type of publication in a library and a count:
+---+----------+---+
| | A | B |
+---+----------+---+
| 1 | Book | 1 |
| 2 | Magazine | 4 |
| 3 | Book | 2 |
| 4 | Comic | 1 |
| 5 | Pamphlet | 2 |
+---+----------+---+
=SUMPRODUCT((A1:A5="Book")*(B1:B5))
This will test to see if A1 is "Book" and return a 1 or 0 then multiple that result by whatever is B1. Then continue for each row in the range up to row 5. The result will 1+0+2+0+0 = 3. There are 3 books in the library (it's not a very big library).
For this answer's sumproduct:
So ($A$2:$A$50=A2) says to return a 1 if A2=A2 or a 0 if A2<>A2. It does that for A2 through A50 comparing it to A2, returning a 1 or a 0.
(B2=$B$2:$B$50) will test each cell B2 through B50 to see if it is equal to B2 and return a 1 or 0 for each test.
The same is true for (D2<$D$2:$D$50) but it's testing to see if the count is less than the current cells count.
So... essentially this is saying "For all the rows 1 through 50, test to find all the other rows that have the same value in Column A and B AND have a count less than this rows count. Count all of those rows up that meet that criteria, and add 1 to it. This is the rank of this row within its group."
Copying this formula has it redetermine that rank for each row allowing you to rank and filter.

Formula for count of distinct values, multiple conditions, one of which = or <> all repeating values

Excel formula (I know this may work with a pivot table, but wanting a formula) to count distinct values. If this is my table in Excel:
Region | Name | Criteria
------ | ------ | ------
1 | Jill | A
1 | Jill | A
1 | John | B
1 | John | A
2 | Jane | B
2 | Jane | B
2 | Bill | A
2 | Bill | B
3 | Mary | B
3 | Mary | B
3 | Gary | A
3 | Gary | A
In this example, I have the following formual to calculate the distinct values within each region =SUM(--(FREQUENCY(IF((Table1[Region]=A2)*(Table1[Name]<>""),MATCH(Table1[Name],Table1[Name],0)),ROW(Table1[Name])-ROW(Table!B2)+1)>0)) which results in 2 each (Region 1=Jill & John; 2=Jane & Bill, 3=Mary & Gary, each distinct name counted once).
I have an addition formula to calculate how many distinct values with criteria where there is at least 1 "B" for each distinct name within each region, by adding *(Table1[Category]="B") after <>"") ... in this example, it would return Region 1=1, Region 2=2, 3=1, because Jill nor Gary do not have "B" - all others have at least one "B".
Now I'm getting stuck on my last formula, where I want to count how many distinct values within each Region have ALL B's in all their occurrences. The outcome should be Region 1=0 (Jill has no B's and John has a B, but also has an A), Region 2=1 (Jane appears twice, counts as 1 distinct value, and both occurrences are B, Bill has a B in one of his), and 3=1 (Mary has all Bs).
It's too complex for a formula-only task, but feasible.
The following array formula does the job. Although you did not specify it, but I suppose that if "Mary" has an A in another region, this should not cancel her counting in region 3, so long as all records with name "Mary" in region 3 have a "B". In other words, names can repeat in different regions but should not interfere across regions (which made the formula even longer. I added a test case for this, Mary in region 4 with an A did not interfere with Mary in region 3).
=SUM(IF((Table1[Region]=Table1[#Region])*(0=COUNTIFS(Table1[Region],Table1[#Region],
Table1[Name],Table1[Name],Table1[Criteria],"<>B")), 1/COUNTIFS(Table1[Name],Table1[Name],
Table1[Criteria],"B",Table1[Region],Table1[#Region]), 0))
Enter it then press CtrlShiftEnter. then copy/paste down the column.

tabulate frequency counts including zeros

To illustrate the problem, consider the following data: 1,2,3,5,3,2. Enter this in a spreadsheet column and make a pivot table displaying the counts. Making use of the information in this pivot table, I want to create a new table, with counts for every value between 1 and 5.
1,1
2,2
3,2
4,0
5,1
What is a good way to do this? My first thought was to use VLOOKUP, trapping any lookup error. But GETPIVOTDATA is apparently preferred for pivot tables. In any case, I failed with both approaches.
To be a bit more specific, assume my pivot table of counts is "PivotTable1" and that I have already created a one column table holding all the needed lookup keys (i.e., the numbers from 1 to 5). What formula should I put in the second column of this new table?
So starting with this:
To illustrate the problem, consider the following data: 1,2,3,5,3,2. Enter this in a spreadsheet column and make a pivot table displaying the counts.
I then created the table like this:
X | Freq
- | ---------------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
2 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
3 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
4 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
5 | =IFERROR(GETPIVOTDATA("X",R3C1,"X",RC[-1]),0)
Or, in A1 mode:
X | Freq
- | -----------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F3),0)
2 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F4),0)
3 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F5),0)
4 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F6),0)
5 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",F7),0)
The column X in my summary table is in column F.
Or as a table formula:
X | Freq
- | -------------------------------------------
1 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
2 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
3 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
4 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
5 | =IFERROR(GETPIVOTDATA("X",$A$3,"X",[#X]),0)
That gave me this result:
X | Freq
- | ----
1 | 1
2 | 2
3 | 2
4 | 0
5 | 1
If performance is not a major concern, you can bypass the pivot table and use the COUNTIF() function.
Create a list of all consecutive numbers that you want the counts for and use COUNTIF() for each of them with the first parameter being the range of your input numbers and the second being the number of the ordered result list:
A B C D
1 1 1 =COUNTIF(A:A,C1)
2 2 2 =COUNTIF(A:A,C2)
3 3 3 =COUNTIF(A:A,C3)
4 5 4 =COUNTIF(A:A,C4)
5 3 5 =COUNTIF(A:A,C5)
6 2

Resources