Stata: searching for specific conditions in different variables

Stata: searching for specific conditions in different variables - search

I have data that look like as follows.
Time Patient Doctor Fee charged in $
Jul-08 A 3 36
Jul-08 B 3 40
Jul-08 A 2 39
Jul-08 A 1 40
Jul-08 B 1 35
Jul-08 C 3 40
Jul-08 D 3 44
Jul-08 E 1 45
Jul-08 E 3 41
Jul-08 F 1 45
Jul-08 F 3 44
Jul-08 G 1 39
Jul-08 H 2 37
Jul-08 H 1 35
Jul-08 H 2 41
For example, Patient A visited Doctor 3 who charged Fee 36 dollars. I want code to give the minimum Fee for a given patient and what happens if a Patient switches to another Doctor.
This is a sample data set for the illustration of my question and I want to do it for almost 30,000 observations.

Look at the help for egen and consider the results of calculations like
egen min1 = min(Fee), by(Patient Doctor)
egen min2 = min(Fee), by(Patient)

Related

Analysis on dataframe with python

I want to be able to calculate the average 'goal','shot',and 'miss' per shooterName to use for further analysis and visualization
The code below gives me the count of the 3 attributes(shot,goal,miss) in the 'event' column sorted by 'shooterName'
Dataframe columns:
season period time teamCode event goal xCord yCord xCordAdjusted yCordAdjusted ... playerPositionThatDidEvent timeSinceFaceoff playerNumThatDidEvent shooterPlayerId shooterName shooterLeftRight shooterTimeOnIce shooterTimeOnIceSinceFaceoff shotDistance
Corresponding data
2020 1 16 PHI SHOT 0 -74 29 74 -29 ... C 16 11 8478439.0 Travis Konecny R 16 16 32.649655
2020 1 34 PIT SHOT 0 49 -25 49 -25 ... C 34 9 8478542.0 Evan Rodrigues R 34 34 47.169906
2020 1 65 PHI SHOT 0 -52 -31 52 31 ... L 65 86 8480797.0 Joel Farabee L 31 31 48.270074
2020 1 171 PIT SHOT 0 43 39 43 39 ... C 42 9 8478542.0 Evan Rodrigues R 42 42 60.307545
2020 1 209 PHI MISS 0 -46 33 46 -33 ... D 38 5 8479026.0 Philippe Myers R 38 38 54.203321
Current code:
dft['count'] = df.groupby(['shooterName', 'event'])['event'].agg(['count'])
dft
Current Output:
shooterName event count
A.J. Greer GOAL 1
MISS 6
SHOT 29
Aaron Downey GOAL 1
MISS 4
SHOT 35
Zenon Konopka GOAL 8
MISS 57
SHOT 176
Desired Output:
shooterName event count %totalshooterNameevents
A.J. Greer GOAL 1 .0277
MISS 6 .1666
SHOT 29 .805
Aaron Downey GOAL 1 .025
MISS 4 .1
SHOT 35 .875
Zenon Konopka GOAL 8 .0331
MISS 57 .236
SHOT 176 .7302
Something similar to this. My end goal is to be able to calculate each 'event' attribute as a percentage of the total 'event' by 'shooterName'. Below I added a column '%totalshooterNameevents' which is 'simply goal', 'shot', and 'miss' calculated by the sum of the 'goal, shot, and miss' per each 'shooterName'

Update
Try:
dft = df.groupby(['shooterName', 'event'])['event'].agg(['count']).reset_index()
dft['%total'] = dft.groupby('shooterName')['count'].apply(lambda x: x / sum(x))
print(dft)
# Output
shooterName event count %total
0 A.J. Greer GOAL 1 0.027778
1 A.J. Greer MISS 6 0.166667
2 A.J. Greer SHOT 29 0.805556
3 Aaron Downey GOAL 1 0.025000
4 Aaron Downey MISS 4 0.100000
5 Aaron Downey SHOT 35 0.875000
6 Zenon Konopka GOAL 8 0.033195
7 Zenon Konopka MISS 57 0.236515
8 Zenon Konopka SHOT 176 0.730290
Without sample, it's difficult to guess what you want. Try:
import pandas as pd
import numpy as np
# Setup a Minimal Reproducible Example
np.random.seed(2021)
df = pd.DataFrame({'shooterName': np.random.choice(list('AB'), 20),
'event': np.random.choice(['shot', 'goal', 'miss'], 20)})
# Create an empty dataframe?
dft = pd.DataFrame(index=df['shooterName'].unique())
# Do stuff
grp = df.groupby('shooterName')
dft['count'] = grp.count()
dft = dft.join(grp['event'].value_counts().unstack('event')
.div(dft['count'], axis=0))
Output:
>>> dft
count goal miss shot
A 12 0.416667 0.250 0.333333
B 8 0.500000 0.375 0.125000

Compare 2 sets of 2 columns in excel with a lookup

Hi all I need help with the following formula I have looked up ways to compare different datasets in excel but this particular is a little different to the examples ive seen. Say i have the following data set
A
B
C
D
E
F
AB
75
AB
75
Bob
AC
56
AC
68
Fre
AB
75
AB
75
Jill
I need a formula that compares (AB with CD) and prints out E where F is.
for example the result above would like this this since AB & CD are equal so print the name
A
B
C
D
E
F
AB
75
AB
75
Bob
Bob, Jill
AC
56
AC
68
Fre
Fre
AB
75
AB
75
Jill

Give a try on below formula.
=TEXTJOIN(", ",TRUE,FILTER($E$1:$E$3,MMULT(($A$1:$B$3=A1:B1)*($C$1:$D$3=C1:D1),TRANSPOSE({1,1}))))

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results

Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

convert binary variable into unique group

I have a binary variable (Var C) that identifies when another variable (Var B) is above or below a different variable (Var A). There is typically a series of the same value for Var C. I'd like to make a new variable (Var D) that represents the unique group of data for the time in between switches.
Hope this helps
VarA VarB VarC VarD
30 28 1 1
32 28 1 1
33 30 1 1
32 32 1 1
34 33 1 1
35 36 0 2
37 38 0 2
38 39 0 2
39 39 0 2
40 39 1 3
38 37 1 3
37 36 1 3
35 33 1 3
Thanks in advance for any help

If your data is in columns A through C you can assign value 1 to cell D2 and use the following formula for the rest of the rows:
=IF(C3<>C2,D2+1,D2)

In D2 enter 1
In D3 enter:
=IF(C3=C2,D2,D2+1)
and copy downward.

Excel - Sorting based on metric values

I have a dataset which looks like:
Product Metrics C1 C2 C3
A1 Q1 20 30 10
Q2 213123 2312 32123
Q3 454 65 45
Q4 3 4 6
A2 Q1 10 5 1
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A3 Q1 18 6 3
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
Now I want to sort the values based on metric Q1 - From smallest to largest (comparing against the product -A1,A2) then the final dataset should look like,
Product Metrics C1 C2 C3
A2 Q1 10 5 1
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A3 Q1 18 6 3
Q2 123 13 23
Q3 454 65 45
Q4 3 4 6
A1 Q1 20 30 10
Q2 213123 2312 32123
Q3 454 65 45
Q4 3 4 6
hope this gives a clear picture. Thanks in advance guys

The way I would probably do it is transpose your columns and rows so that you have columns for Q1, Q2, Q3, Q4.
Like this:
Product Metrics Q1 Q2 Q3 Q4
A1 C1 20 213123 454 3
A1 C2 30 2312 65 4
A1 C3 10 32123 45 6
A2 C1 10 123 454 3
A2 C2 5 13 65 45
A2 C3 1 23 45 6
Then you can sort by Q1 using Data>Sort & Filter

CBRF23 already pointed in the right direction but I believe you have to go even a little bit further and flatten each product related sub-array into a single row like
A | B C D | E F G | H I J | K L M
---| Q1 --------| Q2 ------------ | Q3 ------- | Q4 -------
Pr | C1 C2 C3 | C1 C2 C3 | C1 C2 C3 | C1 C2 C3
A1 | 20 30 10 | 213123 2312 32123 | 454 65 45 | 3 4 6
A2 | 10 5 1 | 123 13 23 | 454 65 45 | 3 4 6
A3 | 18 6 3 | 123 13 23 | 454 65 45 | 3 4 6
(The first row just shows the Excel columns, second row the flattened Q1,Q2,Q3 and Q4 sections and third row the sub-headers for each column)
Now you can safely sort by column B. In case you want to sort by the sum of all Q1 metrics you could introduce another column N being the sum of B,C and D and use that for sorting.
Update:
To get your desired output format back there are basically to possibilities:
If the number of records is known and fixed you can set-up a "results" page in your excel folder with a list of small "sub-tables". The fields of each sub-array then directly reference the "transposed" fields in a line of the sorted master results array.
If the number of results is variable you will have to construct/reconstruct the results page mentioned above using a suitable vba script. The vba generated page can of course also consist of the sorted values directly rather than referencing the values in the sorted master array.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Stata: searching for specific conditions in different variables - search

Look at the help for egen and consider the results of calculations like egen min1 = min(Fee), by(Patient Doctor) egen min2 = min(Fee), by(Patient)

Related

Analysis on dataframe with python

Compare 2 sets of 2 columns in excel with a lookup

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

convert binary variable into unique group

Excel - Sorting based on metric values

Categories

Resources