R: Reversing the data in a time series object - excel

I figured out a way to backcast (ie. predicting the past) with a time series. Now I'm just struggling with the programming in R.
I would like to reverse the time series data so that I can forecast the past. How do I do this?
Say the original time series object looks like this:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 116 99 115 101 112 120 120 110 143 136 147 142
2009 117 114 133 134 139 147 147 131 125 143 136 129
I want it to look like this for the 'backcasting':
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
2008 129 136 143 125 131 147 147 139 134 133 114 117
2009 142 147 136 143 110 120 120 112 101 115 99 116
Note, I didn't forget to change the years - I am basically mirroring/reversing the data and keeping the years, then going to forecast.
I hope this can be done in R? Or should I export and do it in Excel somehow?

Try this:
tt <- ts(1:24, start = 2008, freq = 12)
tt[] <- rev(tt)
ADDED. This also works and does not modify tt :
replace(tt, TRUE, rev(tt))

You can just coerce the matrix to a vector, reverse it, and make it a matrix again. Here's an example:
mat <- matrix(seq(24),nrow=2,byrow=TRUE)
> mat
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 1 2 3 4 5 6 7 8 9 10 11 12
[2,] 13 14 15 16 17 18 19 20 21 22 23 24
> matrix( rev(mat), nrow=nrow(mat) )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] 24 23 22 21 20 19 18 17 16 15 14 13
[2,] 12 11 10 9 8 7 6 5 4 3 2 1

I found this post of Hyndman under http://www.r-bloggers.com/backcasting-in-r/ and am basically pasting in his solution, which in my opinion provids a complete answer to you question.
library(forecast)
x <- WWWusage
h <- 20
f <- frequency(x)
# Reverse time
revx <- ts(rev(x), frequency=f)
# Forecast
fc <- forecast(auto.arima(revx), h)
plot(fc)
# Reverse time again
fc$mean <- ts(rev(fc$mean),end=tsp(x)[1] - 1/f, frequency=f)
fc$upper <- fc$upper[h:1,]
fc$lower <- fc$lower[h:1,]
fc$x <- x
# Plot result
plot(fc, xlim=c(tsp(x)[1]-h/f, tsp(x)[2]))

Related

exception handling attempt in pandas

I am having difficulty creating two columns, "Home Score" and "Away Score", in the wikipedia table I am trying to parse.
I tried the following script with two try-except-else statements to see if that would work.
test_matches = pd.read_html('https://en.wikipedia.org/wiki/List_of_Wales_national_rugby_union_team_results')
test_matches = test_matches[1]
test_matches['Year'] = test_matches['Date'].str[-4:].apply(pd.to_numeric)
test_matches_worst = test_matches[(test_matches['Winner'] != 'Wales') & (test_matches['Year'] >= 2007) & (test_matches['Competition'].str.contains('Nations'))]
try:
test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("–").str[0].apply(pd.to_numeric)
except:
print("let's try again")
else:
test_matches_worst['Home Score'] = test_matches_worst['Score'].str.split("-").str[0].apply(pd.to_numeric)
try:
test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("–").str[1].apply(pd.to_numeric)
except:
print("let's try again")
else:
test_matches_worst['Away Score'] = test_matches_worst['Score'].str.split("-").str[1].apply(pd.to_numeric)
test_matches_worst['Margin'] = (test_matches_worst['Home Score'] - test_matches_worst['Away Score']).abs()
test_matches_worst.sort_values('Margin', ascending=False).reset_index(drop = True)#.head(20)
However, I would receive a Key error message and the "Home Score" is not displayed in the dataframe when shortening the code. What is the best way to handle this particular table and to generate the columns that I want? Any assistance on this would be greatly appreciated. Thanks in advance.
The problem of the data you collect is the hyphen or dash. Except the last row, all score separator are the 'En Dash' (U+2013) and not the 'Hyphen' (U+002D):
sep = r'[-\u2013]'
# df is test_matches_worst
df[['Home Score','Away Score']] = df['Score'].str.split(sep, expand=True).astype(int)
df['Margin'] = df['Home Score'].sub(df['Away Score']).abs
Output:
>>> df[['Score', 'Home Score', 'Away Score', 'Margin']]
Score Home Score Away Score Margin
565 9–19 9 19 10
566 21–9 21 9 12
567 32–21 32 21 11
568 23–20 23 20 3
593 21–16 21 16 5
595 15–17 15 17 2
602 30–17 30 17 13
604 20–26 20 26 6
605 27–12 27 12 15
614 19–26 19 26 7
618 28–9 28 9 19
644 22–30 22 30 8
656 26–3 26 3 23
658 29–18 29 18 11
666 16–21 16 21 5
679 16–16 16 16 0
682 25–21 25 21 4
693 16–21 16 21 5
694 29–13 29 13 16
696 20–18 20 18 2
704 12–6 12 6 6
705 37–27 37 27 10
732 24–14 24 14 10
733 23–27 23 27 4
734 33–30 33 30 3
736 10–14 10 14 4
737 32–9 32 9 23
739 13–24 13 24 11
745 32–30 32 30 2
753 29-7 29 7 22
Note: you will probably receive a SettingWithCopyWarning
To solve it, use test_matches = test_matches[1].copy()
Bonus
Pandas function like to_datetime, to_timedelta or to_numeric can take a Series as parameter so you can avoid apply:
test_matches['Year'] = pd.to_numeric(test_matches['Date'].str[-4:])

Running sums from one column conditional on values in another column

I've got a dataset containing data values associated with times (amongst other categories), and I'd like to add an accumulated value column - that is, the sum of all values up to and including the time. So, taking something like this:
ID YEAR VALUE
0 A 2018 144
1 B 2018 147
2 C 2018 164
3 D 2018 167
4 A 2019 167
5 B 2019 109
6 C 2019 183
7 D 2019 121
8 A 2020 136
9 B 2020 187
10 C 2020 170
11 D 2020 188
and adding a column like this:
ID YEAR VALUE CUMULATIVE_VALUE
0 A 2018 144 144
1 B 2018 147 147
2 C 2018 164 164
3 D 2018 167 167
4 A 2019 167 311
5 B 2019 109 256
6 C 2019 183 347
7 D 2019 121 288
8 A 2020 136 447
9 B 2020 187 443
10 C 2020 170 517
11 D 2020 188 476
Where e.g. in row 7 the CUMULATIVE_VALUE is the sum of the 2 VALUE for ID="D" in years 2018 and 2019 (and not 2020).
I've looked at cumsum() but can't see how I could use it in this specific case so the best I've come up with is this:
import numpy as np
import pandas as pd
np.random.seed(0)
ids=["A","B","C","D"]
years=[2018,2019,2020]
df = pd.DataFrame({"ID": np.tile(ids, 3),
"YEAR": np.repeat(years, 4),
"VALUE": np.random.randint(100,200,12)})
print(df)
df["CUMULATIVE_VALUE"] = None
for id in ids:
for year in years:
df.loc[(df.ID==id) & (df.YEAR==year), "CUMULATIVE_VALUE"] = \
df[(df.ID==id) & (df.YEAR <= year)].VALUE.sum()
print(df)
but I'm sure there must be a better and more efficient way of doing it. Anyone?
You can use pd.Groupby to group by ID and aggregate with cumsum:
df['CUMULATIVE_VALUE'] = df('ID').VALUE.cumsum()
ID YEAR VALUE CUMULATIVE_VALUE
0 A 2018 144 144
1 B 2018 147 147
2 C 2018 164 164
3 D 2018 167 167
4 A 2019 167 311
5 B 2019 109 256
6 C 2019 183 347
7 D 2019 121 288
8 A 2020 136 447
9 B 2020 187 443
10 C 2020 170 517
11 D 2020 188 476
In the case the years are not sorted instead do:
df = df.sort_values(['ID','YEAR']).reset_index(drop=True)
df['cumsum'] = df.groupby('ID').agg({'VALUE':'cumsum'})

Excel: I want to SUM the top 2 values IF a cell in the row matches a string

I've been trying to figure out how to SUM the top 2 values of an array using SUMPRODUCT but I also want to add a criteria that will only sum the product if it matches a specific string. I thought I could combine SUMPRODUCT and SUMIF but I have been unsuccessful.
Position Age ADP Trend Value
QB 23 241 84.2 21
QB 35 185 -37.5 142
QB 27 300 25 19
QB 26 300 25 19
QB 32 300 25 19
RB 22 98 -2.2 1051
RB 24 69 0.3 1929
RB 24 238 6 25
RB 26 300 25 19
RB 26 300 25 19
WR 22 300 25 19
WR 24 300 25 19
WR 26 232 -17 36
WR 25 300 25 19
WR 28 300 25 19
WR 23 9 -4.2 8591
WR 23 178 21.4 161
WR 23 38 8.5 4679
WR 26 222 102.8 53
WR 23 300 25 19
WR 26 300 25 19
TE 26 117 -18.7 617
TE 36 193 -30.3 119
TE 26 199 -22.5 105
TE 24 300 25 19
What I want is to SUM the top two values under the Value column IF the Position = QB.
How can I accomplish this?
Cheers!
Use this array formula:
=SUM(LARGE(IF(A2:A25="QB",E2:E25,""),1),LARGE(IF(A2:A25="QB",E2:E25,""),2))
Press CTRL+SHIFT+ENTER to evaluate the formula as it is an array formula.

Plotting in R in Linux Terminal

I have a text file that I converted into a numeric vector:
numbers <- scan("list_of_numbers.txt")
I then put it into a table:
t <- table(numbers)
Which outputs like this:
1 2 3 4 5 6 7 8 9 10 11
621266 496647 436229 394595 353249 305882 253983 199455 147380 102872 67255
12 13 14 15 16 17 18 19 20 21 22
41934 24506 13778 7179 3646 1778 816 436 217 114 74
23 24 25 26 27 28 29 30 31 32 33
49 44 26 21 19 21 20 14 9 17 14
34 35 36 37 38 39 40 41 42 43 44
7 11 9 14 3 5 8 4 4 2 3
45 46 47 55 56 60 62 63 69 70 72
2 1 2 2 2 1 1 1 3 2 1
78 82 85 93 95 114 125 265 331 350
1 1 1 1 1 1 1 1 1 1
How would I plot a line graph with x axis of numbers 1 - 25 and y axis the frequency values of the x axis all in the terminal window?
In addition, how can a plot like this (which is default saved as a .pdf file) be viewd in the linux terminal?
Most commands like less, cat, and xdg-open output a bunch of strange unreadable symbols.
You can use fbi, the linux framebuffer imageviewer to open pdf files in the linux console.
A small problem can be that it needs root privileges. It seems like it can not run through R using system, it complains about not being a linux console. But you can use it in the terminal like:
sudo fbi Rplots.pdf
As for the plotting part of your question you can just use something like:
plot(t, xlim = c(1, 25))
Hope it helps,
alex
I think it's very convenient to use txtplot::txtplot as follow:
> cat("1 2 3 4 5 6", file = "list_of_numbers.txt", sep = "\n")
> numbers <- scan("list_of_numbers.txt")
Read 6 items
> t <- table(numbers)
> txtplot(t)
You can install it just by this command:
install.packages('txtplot')
I found that Jupyter may be the best wheel for us to handle that, and we can equip that following this tutorial: Embed Graphs In Jupyter Notebooks in R
References:
How To Install R on Ubuntu 18.04
scan

How to calculate 95th percentile in Excel 2010 [duplicate]

This question already has answers here:
Calculate Percentile in Excel 2010
(3 answers)
Closed 9 years ago.
I am trying to calculate how many calls came back in 95 percentile of time. Below is my Result Set. I am working with Excel 2010
Milliseconds Number
0 1702
1 15036
2 14262
3 13190
4 9137
5 5635
6 3742
7 2628
8 1899
9 1298
10 963
11 727
12 503
13 415
14 311
15 235
16 204
17 140
18 109
19 83
20 72
21 55
22 52
23 35
24 33
25 25
26 15
27 18
28 14
29 15
30 13
31 19
32 23
33 19
34 21
35 20
36 25
37 26
38 13
39 12
40 10
41 17
42 6
43 7
44 8
45 4
46 7
47 9
48 11
49 12
50 9
51 9
52 9
53 8
54 10
55 10
56 11
57 3
58 7
59 7
60 2
61 5
62 7
63 5
64 5
65 2
66 3
67 2
68 1
70 1
71 2
72 1
73 4
74 1
75 1
76 1
77 3
80 1
81 1
85 1
87 2
93 1
96 1
100 1
107 1
112 1
116 1
125 1
190 1
356 1
450 1
492 1
497 1
554 1
957 1
Just some background what does above information means-
1702 calls came back in 0 milliseconds
15036 calls came back in 1 milliseconds
14262 calls came back in 2 milliseconds
etc etc
So to calculate the 95th percentile from the above data, I am using this formula in excel 2010-
=PERCENTILE.EXC(IF(TRANSPOSE(ROW(INDIRECT("1:"&MAX(H$2:H$96))))<=H$2:H$96,A$2:A$96),0.95)
Can anyone help me whether the way I am doing in Excel 2010 is right or not?
I am getting 95th percentile as 10 by using the above scenario.
Thanks for the help.
that's essentially the same question you asked here and the formula I suggested. As per my last comments in that question - that formula should work OK as long as you use CTRL+SHIFT+ENTER correctly. I get 10 as the answer for this example using that formula.
I think you can verify manually that that is indeed the correct answer. If you have a running total in an adjacent column then you can see where the 95th percentile is reached......

Resources