Stata: searching for specific conditions in different variables - search

I have data that look like as follows.
Time Patient Doctor Fee charged in $
Jul-08 A 3 36
Jul-08 B 3 40
Jul-08 A 2 39
Jul-08 A 1 40
Jul-08 B 1 35
Jul-08 C 3 40
Jul-08 D 3 44
Jul-08 E 1 45
Jul-08 E 3 41
Jul-08 F 1 45
Jul-08 F 3 44
Jul-08 G 1 39
Jul-08 H 2 37
Jul-08 H 1 35
Jul-08 H 2 41
For example, Patient A visited Doctor 3 who charged Fee 36 dollars. I want code to give the minimum Fee for a given patient and what happens if a Patient switches to another Doctor.
This is a sample data set for the illustration of my question and I want to do it for almost 30,000 observations.

Look at the help for egen and consider the results of calculations like
egen min1 = min(Fee), by(Patient Doctor)
egen min2 = min(Fee), by(Patient)

Related

Analysis on dataframe with python

I want to be able to calculate the average 'goal','shot',and 'miss' per shooterName to use for further analysis and visualization
The code below gives me the count of the 3 attributes(shot,goal,miss) in the 'event' column sorted by 'shooterName'
Dataframe columns:
season period time teamCode event goal xCord yCord xCordAdjusted yCordAdjusted ... playerPositionThatDidEvent timeSinceFaceoff playerNumThatDidEvent shooterPlayerId shooterName shooterLeftRight shooterTimeOnIce shooterTimeOnIceSinceFaceoff shotDistance
Corresponding data
2020 1 16 PHI SHOT 0 -74 29 74 -29 ... C 16 11 8478439.0 Travis Konecny R 16 16 32.649655
2020 1 34 PIT SHOT 0 49 -25 49 -25 ... C 34 9 8478542.0 Evan Rodrigues R 34 34 47.169906
2020 1 65 PHI SHOT 0 -52 -31 52 31 ... L 65 86 8480797.0 Joel Farabee L 31 31 48.270074
2020 1 171 PIT SHOT 0 43 39 43 39 ... C 42 9 8478542.0 Evan Rodrigues R 42 42 60.307545
2020 1 209 PHI MISS 0 -46 33 46 -33 ... D 38 5 8479026.0 Philippe Myers R 38 38 54.203321
Current code:
dft['count'] = df.groupby(['shooterName', 'event'])['event'].agg(['count'])
dft
Current Output:
shooterName event count
A.J. Greer GOAL 1
MISS 6
SHOT 29
Aaron Downey GOAL 1
MISS 4
SHOT 35
Zenon Konopka GOAL 8
MISS 57
SHOT 176
Desired Output:
shooterName event count %totalshooterNameevents
A.J. Greer GOAL 1 .0277
MISS 6 .1666
SHOT 29 .805
Aaron Downey GOAL 1 .025
MISS 4 .1
SHOT 35 .875
Zenon Konopka GOAL 8 .0331
MISS 57 .236
SHOT 176 .7302
Something similar to this. My end goal is to be able to calculate each 'event' attribute as a percentage of the total 'event' by 'shooterName'. Below I added a column '%totalshooterNameevents' which is 'simply goal', 'shot', and 'miss' calculated by the sum of the 'goal, shot, and miss' per each 'shooterName'
Update
Try:
dft = df.groupby(['shooterName', 'event'])['event'].agg(['count']).reset_index()
dft['%total'] = dft.groupby('shooterName')['count'].apply(lambda x: x / sum(x))
print(dft)
# Output
shooterName event count %total
0 A.J. Greer GOAL 1 0.027778
1 A.J. Greer MISS 6 0.166667
2 A.J. Greer SHOT 29 0.805556
3 Aaron Downey GOAL 1 0.025000
4 Aaron Downey MISS 4 0.100000
5 Aaron Downey SHOT 35 0.875000
6 Zenon Konopka GOAL 8 0.033195
7 Zenon Konopka MISS 57 0.236515
8 Zenon Konopka SHOT 176 0.730290
Without sample, it's difficult to guess what you want. Try:
import pandas as pd
import numpy as np
# Setup a Minimal Reproducible Example
np.random.seed(2021)
df = pd.DataFrame({'shooterName': np.random.choice(list('AB'), 20),
'event': np.random.choice(['shot', 'goal', 'miss'], 20)})
# Create an empty dataframe?
dft = pd.DataFrame(index=df['shooterName'].unique())
# Do stuff
grp = df.groupby('shooterName')
dft['count'] = grp.count()
dft = dft.join(grp['event'].value_counts().unstack('event')
.div(dft['count'], axis=0))
Output:
>>> dft
count goal miss shot
A 12 0.416667 0.250 0.333333
B 8 0.500000 0.375 0.125000

Compare 2 sets of 2 columns in excel with a lookup

Hi all I need help with the following formula I have looked up ways to compare different datasets in excel but this particular is a little different to the examples ive seen. Say i have the following data set
A
B
C
D
E
F
AB
75
AB
75
Bob
AC
56
AC
68
Fre
AB
75
AB
75
Jill
I need a formula that compares (AB with CD) and prints out E where F is.
for example the result above would like this this since AB & CD are equal so print the name
A
B
C
D
E
F
AB
75
AB
75
Bob
Bob, Jill
AC
56
AC
68
Fre
Fre
AB
75
AB
75
Jill
Give a try on below formula.
=TEXTJOIN(", ",TRUE,FILTER($E$1:$E$3,MMULT(($A$1:$B$3=A1:B1)*($C$1:$D$3=C1:D1),TRANSPOSE({1,1}))))

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

convert binary variable into unique group

I have a binary variable (Var C) that identifies when another variable (Var B) is above or below a different variable (Var A). There is typically a series of the same value for Var C. I'd like to make a new variable (Var D) that represents the unique group of data for the time in between switches.
Hope this helps
VarA VarB VarC VarD
30 28 1 1
32 28 1 1
33 30 1 1
32 32 1 1
34 33 1 1
35 36 0 2
37 38 0 2
38 39 0 2
39 39 0 2
40 39 1 3
38 37 1 3
37 36 1 3
35 33 1 3
Thanks in advance for any help
If your data is in columns A through C you can assign value 1 to cell D2 and use the following formula for the rest of the rows:
=IF(C3<>C2,D2+1,D2)
In D2 enter 1
In D3 enter:
=IF(C3=C2,D2,D2+1)
and copy downward.

Excel - Sorting based on metric values

I have a dataset which looks like:
Product Metrics C1 C2 C3
A1 Q1 20 30 10
Q2 213123 2312 32123
Q3 454 65 45
Q4 3 4 6
A2 Q1 10 5 1
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A3 Q1 18 6 3
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
Now I want to sort the values based on metric Q1 - From smallest to largest (comparing against the product -A1,A2) then the final dataset should look like,
Product Metrics C1 C2 C3
A2 Q1 10 5 1
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A3 Q1 18 6 3
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A1 Q1 20 30 10
Q2 213123 2312 32123
Q3 454 65 45
Q4 3 4 6
hope this gives a clear picture. Thanks in advance guys
The way I would probably do it is transpose your columns and rows so that you have columns for Q1, Q2, Q3, Q4.
Like this:
Product Metrics Q1 Q2 Q3 Q4
A1 C1 20 213123 454 3
A1 C2 30 2312 65 4
A1 C3 10 32123 45 6
A2 C1 10 123 454 3
A2 C2 5 13 65 45
A2 C3 1 23 45 6
Then you can sort by Q1 using Data>Sort & Filter
CBRF23 already pointed in the right direction but I believe you have to go even a little bit further and flatten each product related sub-array into a single row like
A | B C D | E F G | H I J | K L M
---| Q1 --------| Q2 ------------ | Q3 ------- | Q4 -------
Pr | C1 C2 C3 | C1 C2 C3 | C1 C2 C3 | C1 C2 C3
A1 | 20 30 10 | 213123 2312 32123 | 454 65 45 | 3 4 6
A2 | 10 5 1 | 123 13 23 | 454 65 45 | 3 4 6
A3 | 18 6 3 | 123 13 23 | 454 65 45 | 3 4 6
(The first row just shows the Excel columns, second row the flattened Q1,Q2,Q3 and Q4 sections and third row the sub-headers for each column)
Now you can safely sort by column B. In case you want to sort by the sum of all Q1 metrics you could introduce another column N being the sum of B,C and D and use that for sorting.
Update:
To get your desired output format back there are basically to possibilities:
If the number of records is known and fixed you can set-up a "results" page in your excel folder with a list of small "sub-tables". The fields of each sub-array then directly reference the "transposed" fields in a line of the sorted master results array.
If the number of results is variable you will have to construct/reconstruct the results page mentioned above using a suitable vba script. The vba generated page can of course also consist of the sorted values directly rather than referencing the values in the sorted master array.

Resources