Grouping on multiple variables in R - excel

I'm a power excel pivot table user who is forcing himself to learn R. I know exactly how to do this analysis in excel, but can't figure out the right way to code this in R.
I'm trying to group user data by 2 different variables, while grouping the variables into ranges (or bins), then summarizing other variables.
Here is what the data looks like:
userid visits posts revenue
1 25 0 25
2 2 2 0
3 86 7 8
4 128 24 94
5 30 5 18
… … … …
280000 80 10 100
280001 42 4 25
280002 31 8 17
Here is what I am trying to get the output to look like:
VisitRange PostRange # of Users Total Revenue Average Revenue
0 0 X Y Z
1-10 0 X Y Z
11-20 0 X Y Z
21-30 0 X Y Z
31-40 0 X Y Z
41-50 0 X Y Z
> 50 0 X Y Z
0 1-10 X Y Z
1-10 1-10 X Y Z
11-20 1-10 X Y Z
21-30 1-10 X Y Z
31-40 1-10 X Y Z
41-50 1-10 X Y Z
> 50 1-10 X Y Z
want to group by visits and posts by 10 up to a certain level, then group anything higher than 50 as '> 51'
I've looked a tapply and ddply as ways to accomplish this, but I don't think they will work the way I am expecting, but I could be wrong.
Lastly, I know I could do this in SQL using and if/then statement to identify the range of visits and the range of posts (for example - If visits between 1 and 10, then '1-10'), then just group by visit range and post range, but my goal here is to start forcing myself to use R. Maybe R isn't the right tool here, but I think it is…
All help would be appreciated. Thanks in advance.

The idiom in the plyr package and ddply in particular, is very similar to pivot tables in Excel.
In your example, the only thing you need to do is the cut your grouping variables into the desired breaks, before passing to ddply. Here is an example:
First, create some sample data:
set.seed(1)
dat <- data.frame(
userid = 1:500,
visits =sample(0:50, 500, replace=TRUE),
posts = sample(0:50, 500, replace=TRUE),
revenue = sample(1:100, replace=TRUE)
)
Now, use cut to divide your grouping variables into the desired ranges:
dat$PostRange <- cut(dat$posts, breaks=seq(0, 50, 10), include.lowest=TRUE)
dat$VisitRange <- cut(dat$visits, breaks=seq(0, 50, 10), include.lowest=TRUE)
Finally, use ddply with summarise:
library(plyr)
ddply(dat, .(VisitRange, PostRange),
summarise,
Users=length(userid),
`Total Revenue`=sum(revenue),
`Average Revenue`=mean(revenue))
The results:
VisitRange PostRange Users Total Revenue Average Revenue
1 [0,10] [0,10] 23 1318 57.30435
2 [0,10] (10,20] 23 1136 49.39130
3 [0,10] (20,30] 28 1499 53.53571
4 [0,10] (30,40] 20 923 46.15000
5 [0,10] (40,50] 14 826 59.00000
6 (10,20] [0,10] 23 1227 53.34783
7 (10,20] (10,20] 17 642 37.76471
8 (10,20] (20,30] 20 888 44.40000
9 (10,20] (30,40] 15 622 41.46667
10 (10,20] (40,50] 21 968 46.09524
11 (20,30] [0,10] 23 1226 53.30435
12 (20,30] (10,20] 19 1021 53.73684
13 (20,30] (20,30] 23 1380 60.00000
14 (20,30] (30,40] 8 313 39.12500
15 (20,30] (40,50] 19 710 37.36842
16 (30,40] [0,10] 18 782 43.44444
17 (30,40] (10,20] 25 1308 52.32000
18 (30,40] (20,30] 14 553 39.50000
19 (30,40] (30,40] 26 1131 43.50000
20 (30,40] (40,50] 20 1295 64.75000
21 (40,50] [0,10] 20 958 47.90000
22 (40,50] (10,20] 21 1168 55.61905
23 (40,50] (20,30] 20 1118 55.90000
24 (40,50] (30,40] 20 1009 50.45000
25 (40,50] (40,50] 20 934 46.70000

Related

exception handling attempt in pandas

I am having difficulty creating two columns, "Home Score" and "Away Score", in the wikipedia table I am trying to parse.
I tried the following script with two try-except-else statements to see if that would work.
test_matches = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
test_matches = test_matches[1]
test_matches['Year'] = test_matches['Date'].str[-4:].apply(pd.to_numeric)
test_matches_worst = test_matches[(test_matches['Winner'] != 'Wales') & (test_matches['Year'] >= 2007) & (test_matches['Competition'].str.contains('Nations'))]
try:
test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("–").str[0].apply(pd.to_numeric)
except:
print("let's try again")
else:
test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("-").str[0].apply(pd.to_numeric)
try:
test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("–").str[1].apply(pd.to_numeric)
except:
print("let's try again")
else:
test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("-").str[1].apply(pd.to_numeric)
test_matches_worst['Margin'] = (test_matches_worst['Home Score'] - test_matches_worst['Away Score']).abs()
test_matches_worst.sort_values('Margin', ascending=False).reset_index(drop = True)#.head(20)
However, I would receive a Key error message and the "Home Score" is not displayed in the dataframe when shortening the code. What is the best way to handle this particular table and to generate the columns that I want? Any assistance on this would be greatly appreciated. Thanks in advance.
The problem of the data you collect is the hyphen or dash. Except the last row, all score separator are the 'En Dash' (U+2013) and not the 'Hyphen' (U+002D):
sep = r'[-\u2013]'
# df is test_matches_worst
df[['Home Score','Away Score']] = df['Score'].str.split(sep, expand=True).astype(int)
df['Margin'] = df['Home Score'].sub(df['Away Score']).abs
Output:
>>> df[['Score', 'Home Score', 'Away Score', 'Margin']]
Score Home Score Away Score Margin
565 9–19 9 19 10
566 21–9 21 9 12
567 32–21 32 21 11
568 23–20 23 20 3
593 21–16 21 16 5
595 15–17 15 17 2
602 30–17 30 17 13
604 20–26 20 26 6
605 27–12 27 12 15
614 19–26 19 26 7
618 28–9 28 9 19
644 22–30 22 30 8
656 26–3 26 3 23
658 29–18 29 18 11
666 16–21 16 21 5
679 16–16 16 16 0
682 25–21 25 21 4
693 16–21 16 21 5
694 29–13 29 13 16
696 20–18 20 18 2
704 12–6 12 6 6
705 37–27 37 27 10
732 24–14 24 14 10
733 23–27 23 27 4
734 33–30 33 30 3
736 10–14 10 14 4
737 32–9 32 9 23
739 13–24 13 24 11
745 32–30 32 30 2
753 29-7 29 7 22
Note: you will probably receive a SettingWithCopyWarning
To solve it, use test_matches = test_matches[1].copy()
Bonus
Pandas function like to_datetime, to_timedelta or to_numeric can take a Series as parameter so you can avoid apply:
test_matches['Year'] = pd.to_numeric(test_matches['Date'].str[-4:])

Efficient way to populate missing indexes from pandas group by

I grouped a column in a pandas dataframe by the number of occurrences of an event per hour of the day like so:
df_sep.hour.groupby(df_sep.time.dt.hour).size()
Which gives the following result:
time
2 31
3 6
4 7
5 4
6 38
7 9
8 5
9 31
10 8
11 2
12 5
13 30
14 1
15 1
16 28
18 1
20 4
21 29
Name: hour, dtype: int64
For plotting, I would like to complete the series for each hour of the day. ie, there are no occurrences at midnight (0). So for every missing hour, I would like to create that index and add zero to the corresponding value.
To solve this I created two lists (x and y) using the following loop, but it feels a bit hacky... is there a better way to solve this?
x = []
y = []
for i in range(24):
if i not in df_sep.hour.groupby(df_sep.time.dt.hour).size().index:
x.append(i)
y.append(0)
else:
x.append(i)
y.append(df_sep.hour.groupby(df_sep.time.dt.hour).size().loc[i])
result:
for i, j in zip(x, y):
print(i, j)
0 0
1 0
2 31
3 6
4 7
5 4
6 38
7 9
8 5
9 31
10 8
11 2
12 5
13 30
14 1
15 1
16 28
17 0
18 1
19 0
20 4
21 29
22 0
23 0
Use Series.reindex with range(24):
df_sep.hour.groupby(df_sep.time.dt.hour).size().reindex(range(24), fill_value=0)

Apply z-score across all attributes by country

I'm trying to clean up a dataset that has data on every country in the world from 2000-2015. The population data by year is quite bad - I want to assign a z scores for each country's population data by year so I can see which data points to drop as outliers. How would I do this? I'm thinking I need to use groupby(), but I'm not sure how to deploy it.
I'm working with this WHO Kaggle dataset: https://www.kaggle.com/kumarajarshi/life-expectancy-who/data#
The data generally looks like this:
Example
Maybe, something like this might work -
import numpy as np, pandas as pd
l1 = ['a'] * 5 + ['b'] * 10 + ['c'] * 8
l2 = list(np.random.randint(10,20,size=5)) + list(np.random.randint(100,150, size=10)) + list(np.random.randint(75,100, size=8))
df = pd.DataFrame({'cat':l1, 'values':l2}) #creating a dummy dataframe
df
cat values
0 a 18
1 a 17
2 a 11
3 a 13
4 a 11
5 b 102
6 b 103
7 b 119
8 b 113
9 b 100
10 b 113
11 b 102
12 b 108
13 b 128
14 b 126
15 c 75
16 c 96
17 c 81
18 c 90
19 c 80
20 c 95
21 c 96
22 c 86
df['z-score'] = df.groupby(['cat'])['values'].apply(lambda x: (x - x.mean())/x.std())
df
cat values z-score
0 a 18 1.206045
1 a 17 0.904534
2 a 11 -0.904534
3 a 13 -0.301511
4 a 11 -0.904534
5 b 102 -0.919587
6 b 103 -0.821759
7 b 119 0.743496
8 b 113 0.156525
9 b 100 -1.115244
10 b 113 0.156525
11 b 102 -0.919587
12 b 108 -0.332617
13 b 128 1.623951
14 b 126 1.428295
15 c 75 -1.520176
16 c 96 1.059516
17 c 81 -0.783121
18 c 90 0.322461
19 c 80 -0.905963
20 c 95 0.936674
21 c 96 1.059516
22 c 86 -0.168908

create matching label with periodic repeating values

I have data like the input data data_df2 sample below. I have the code below that creates the label column by comparing the Cleaned column value to the value in the record before it and then either giving it the same letter, if the values match, or a new value. The problem I have is that I would like the letters that are chosen for the label column to start over with every new label_set_id. So the label value for first label_set_id=2 would be A. Every 20 records the label_set_id goes up by 1. Can anyone suggest how I can modify the code below to accomplish this? Or is there a slicker way with pandas, say using the apply function. This code does run kind of slow.
code:
data_df2['label']=''
c=65
data_df2.label[0]=chr(c)
c=c+1
for i in range(1,len(data_df2)):
if(data_df2.loc[i,'Cleaned']==data_df2.loc[i-1,'Cleaned']):
data_df2.label[i]=data_df2.label[i-1]
else:
data_df2.label[i]=chr(c)
c=c+1
input data:
print(data_df2[:30])
id Source \
0 1 ,O-PEN 2.0
1 2 .7 FRAM BLOWER - BROTHERLY LOVE MECHANIC
2 3 #BEEZLEEXTRACTS
3 4 #CALISIFTCO_
4 5 #CALISIFTCO_ X #_ZKITTLEZ_
5 6 #CALISIFTCO_ X #WONDERBRETT
6 7 #CALISIFTCO_ X #WONDERBRETT_
7 8 #DNA_GENETICS
8 9 #EDENEXTRACTS_CA
9 10 #EDENEXTRACTS_CA X #CALISIFTCO_
10 11 #FULLFLAVAEXTRACT
11 12 #GGSTRAINS
12 13 #SHERBINSKI415
13 14 #STR8MECHANIC X #ICEDOUTEXTRACTS
14 15 #STR8MECHANIC X #REZHEADS215
15 16 [SS] 710 LABS
16 17 [SS] ABSOLUTE EXTRACTS
17 18 [SS] BIG PETE'S
18 19 [SS] BLOOM FARMS
19 20 [SS] BLUE RIVER
20 21 [SS] BRITE LABS
21 22 [SS] BROTHERLY LOVE
22 23 [SS] BROTHERLY LOVE [3 PACK]
23 24 [SS] CALIFORNIA DREAMIN
24 25 [SS] DIME BAG
25 26 [SS] EDEN INFUSIONS
26 27 [SS] EEL RIVER
27 28 [SS] GANJA GOLD
28 29 [SS] GLOWING BUDDHA
29 30 [SS] JETTY
Cleaned label_set_id label
0 O.PEN VAPE 1 A
1 BROTHERLY LOVE 1 B
2 BEEZLE EXTRACTS 1 C
3 CALI SIFT CO 1 D
4 CALI SIFT CO 1 D
5 #CALISIFTCO_ X #WONDERBRETT_ 1 E
6 #CALISIFTCO_ X #WONDERBRETT_ 1 E
7 DNA GENETICS 1 F
8 EDEN 1 G
9 CALI SIFT CO 1 H
10 FLAV RX 1 I
11 GG STRAINS 1 J
12 SHERBINSKI 1 K
13 STR8 MECHANIC 1 L
14 STR8 MECHANIC 1 L
15 710 LABS 1 M
16 ABSOLUTE XTRACTS 1 N
17 BIG PETE'S TREATS 1 O
18 BLOOM FARMS 1 P
19 BLUE RIVER 1 Q
20 BRITE LABS 2 R
21 BROTHERLY LOVE 2 S
22 BROTHERLY LOVE 2 S
23 CALIFORNIA DREAMIN 2 T
24 DIME BAG 2 U
25 EDEN 2 V
26 EEL RIVER 2 W
27 GANJA GOLD 2 X
28 GLOWING BUDDHA 2 Y
29 JETTY EXTRACTS 2 Z
output data:
id Source \
0 1 ,O-PEN 2.0
1 2 .7 FRAM BLOWER - BROTHERLY LOVE MECHANIC
2 3 #BEEZLEEXTRACTS
3 4 #CALISIFTCO_
4 5 #CALISIFTCO_ X #_ZKITTLEZ_
5 6 #CALISIFTCO_ X #WONDERBRETT
6 7 #CALISIFTCO_ X #WONDERBRETT_
7 8 #DNA_GENETICS
8 9 #EDENEXTRACTS_CA
9 10 #EDENEXTRACTS_CA X #CALISIFTCO_
10 11 #FULLFLAVAEXTRACT
11 12 #GGSTRAINS
12 13 #SHERBINSKI415
13 14 #STR8MECHANIC X #ICEDOUTEXTRACTS
14 15 #STR8MECHANIC X #REZHEADS215
15 16 [SS] 710 LABS
16 17 [SS] ABSOLUTE EXTRACTS
17 18 [SS] BIG PETE'S
18 19 [SS] BLOOM FARMS
19 20 [SS] BLUE RIVER
20 21 [SS] BRITE LABS
21 22 [SS] BROTHERLY LOVE
22 23 [SS] BROTHERLY LOVE [3 PACK]
23 24 [SS] CALIFORNIA DREAMIN
24 25 [SS] DIME BAG
25 26 [SS] EDEN INFUSIONS
26 27 [SS] EEL RIVER
27 28 [SS] GANJA GOLD
28 29 [SS] GLOWING BUDDHA
29 30 [SS] JETTY
Cleaned label_set_id label
0 O.PEN VAPE 1 A
1 BROTHERLY LOVE 1 B
2 BEEZLE EXTRACTS 1 C
3 CALI SIFT CO 1 D
4 CALI SIFT CO 1 D
5 #CALISIFTCO_ X #WONDERBRETT_ 1 E
6 #CALISIFTCO_ X #WONDERBRETT_ 1 E
7 DNA GENETICS 1 F
8 EDEN 1 G
9 CALI SIFT CO 1 H
10 FLAV RX 1 I
11 GG STRAINS 1 J
12 SHERBINSKI 1 K
13 STR8 MECHANIC 1 L
14 STR8 MECHANIC 1 L
15 710 LABS 1 M
16 ABSOLUTE XTRACTS 1 N
17 BIG PETE'S TREATS 1 O
18 BLOOM FARMS 1 P
19 BLUE RIVER 1 Q
20 BRITE LABS 2 A
21 BROTHERLY LOVE 2 B
22 BROTHERLY LOVE 2 B
23 CALIFORNIA DREAMIN 2 C
24 DIME BAG 2 D
25 EDEN 2 E
26 EEL RIVER 2 F
27 GANJA GOLD 2 G
28 GLOWING BUDDHA 2 H
29 JETTY EXTRACTS 2 I
IIUC, you can use groupby on label_set_id and check where two following rows are different with shift, and use cumsum to get an incremental value per group. Add 64 for map the chr function.
#dummy example
df = pd.DataFrame({'Cleaned':list('abbcddeffijkllmn'),
'label_set_id':[1]*8+[2]*8})
#create the column label
df['label'] = list(map(chr, df.groupby('label_set_id')['Cleaned']\
.apply(lambda x: x.ne(x.shift()).cumsum())+64))
print (df)
Cleaned label_set_id label
0 a 1 A
1 b 1 B
2 b 1 B #same cleaned than previous row
3 c 1 C
4 d 1 D
5 d 1 D
6 e 1 E
7 f 1 F
8 f 2 A #restart at A for new label_set_id
9 i 2 B
10 j 2 C
11 k 2 D
12 l 2 E
13 l 2 E
14 m 2 F
15 n 2 G
EDIT: if the data is ordered in terms of label_set_id, you can do it without groupby:
df['label'] = df['Cleaned'].ne(df['Cleaned'].shift()) .cumsum()
df['label'] = list(map(chr, df['label']
-df['label'].where(df['label_set_id'].ne(df['label_set_id'].shift()))\
.ffill().astype(int) + 65 ))

R: Reversing the data in a time series object

I figured out a way to backcast (ie. predicting the past) with a time series. Now I'm just struggling with the programming in R.
I would like to reverse the time series data so that I can forecast the past. How do I do this?
Say the original time series object looks like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 116 99 115 101 112 120 120 110 143 136 147 142
2009 117 114 133 134 139 147 147 131 125 143 136 129
I want it to look like this for the 'backcasting':
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 129 136 143 125 131 147 147 139 134 133 114 117
2009 142 147 136 143 110 120 120 112 101 115 99 116
Note, I didn't forget to change the years - I am basically mirroring/reversing the data and keeping the years, then going to forecast.
I hope this can be done in R? Or should I export and do it in Excel somehow?
Try this:
tt <- ts(1:24, start = 2008, freq = 12)
tt[] <- rev(tt)
ADDED. This also works and does not modify tt :
replace(tt, TRUE, rev(tt))
You can just coerce the matrix to a vector, reverse it, and make it a matrix again. Here's an example:
mat <- matrix(seq(24),nrow=2,byrow=TRUE)
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 2 3 4 5 6 7 8 9 10 11 12
[2,] 13 14 15 16 17 18 19 20 21 22 23 24
> matrix( rev(mat), nrow=nrow(mat) )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 24 23 22 21 20 19 18 17 16 15 14 13
[2,] 12 11 10 9 8 7 6 5 4 3 2 1
I found this post of Hyndman under http://www.r-bloggers.com/backcasting-in-r/ and am basically pasting in his solution, which in my opinion provids a complete answer to you question.
library(forecast)
x <- WWWusage
h <- 20
f <- frequency(x)
# Reverse time
revx <- ts(rev(x), frequency=f)
# Forecast
fc <- forecast(auto.arima(revx), h)
plot(fc)
# Reverse time again
fc$mean <- ts(rev(fc$mean),end=tsp(x)[1] - 1/f, frequency=f)
fc$upper <- fc$upper[h:1,]
fc$lower <- fc$lower[h:1,]
fc$x <- x
# Plot result
plot(fc, xlim=c(tsp(x)[1]-h/f, tsp(x)[2]))

Resources