How to subset panel data for different periods for each country - subset

I have a panel dataset with values for 15 variables for 50 countries over the period 1945-2021. The unit of analysis is country-year. Here's a simplified version of the dataset just to show what it looks like.
set.seed(42)
n <- 20
Data <- data.frame(Country=rep(LETTERS[1:5], n/5),date=sample(1945:2021, n, replace=TRUE), variable1=sample(18:30, n, replace=TRUE), variable2=sample(10:100, n, replace=TRUE),variable3=rnorm(n))
Data
Is there a way to set up a function to subset the data such that for country A, I subset for the years 1914~2004, 2005~2018, 2019~2021; but the specific periods differ by country? Country B for example 1930~1963, 1964-1965, 1966-1983; Country C would be 1900-1991, 1992-2018, 2019-2021, etc?
Data2 <- subset(Data, {country_name == "A" & (year >= "1914" & year <= "2004") | (year >= "2005" & year <= "2018") | (year >= "2019" & year <= "2021")})
I'm wondering if there is a way to do it more quickly and efficiently than the example above.
Thank you!

Related

How to calculate the annual rate in python 3

Here's the code:
P = int(input("Enter starting principle please.\n"))
n = int(input("Enter Compound interest rate.(daily, monthly, quarterly,half-year, yearly)\n"))
r = float(input("Enter annual interest amount. (decimal)\n"))
t = int(input("Enter the amount of years.\n"))
t = 1
while t-1 <= 5-1 :
final = P * (((1 + (r/n)) ** (n*t)))
t += 1
print ("The final amount after", round(t-1), "years is", round(final,2))
When I tried to input:
1000
1
0.02
2
it will result like this:
Enter starting principle please.
Enter Compound interest rate.(daily, monthly, quarterly, half-year, yearly)
Enter annual interest amount. (decimal)
Enter the amount of years.
The final amount after 1 years is 1020.0
The final amount after 2 years is 1040.4
The final amount after 3 years is 1061.21
The final amount after 4 years is 1082.43
The final amount after 5 years is 1104.08
The problem is, it will not return to the require input number of years (e.g when I tried to input 2 years it will print up to 5 years)
Why are you setting t from input, but then immediately ignoring the input and overwriting its value with t = 1?
And where is this 5 coming from: while t-1 <= 5-1
Essentially you need a new variable for what you're doing in the loop, separate from t. And "magic numbers" like 5 appearing in code for no reason are something to be avoided.
P = int(input("Enter starting principle please.\n"))
n = int(input("Enter Compound interest rate.(daily, monthly, quarterly, half-year, yearly)\n"))
r = float(input("Enter annual interest amount. (decimal)\n"))
t = int(input("Enter the amount of years.\n"))
for t in range(1, t+1):
final = P * (((1 + (r/n)) ** (n*t)))
t += 1
print ("The final amount after", round(t-1), "years is", round(final,2))

DAX Calculate Year To Date - Year to Previous Month (year change)

Trying to figure out how to calculate the equivalent to YTD but up til previous month when year changes.
For example:
Date | Value
2018-10 | 100
2018-11 | 100
2018-12 | 100
2019-01 | 100
2019-02 | 100
2019-03 | 100
Results expected:
When 2019-03 is selected
YTD = 300 (accumulated from 2019-01- to 2019-03)
Previous month accumulated = 200 (accumulated from 2019-01 to 2019-02)
When 2019-01 is selected
YTD = 100
Previous month accumulated = 300 (accumulated from 2018-10 to 2018-12)
I've tried:
Value_Accum_PreviousMonth:=TOTALYTD([Value];Calendar[Date]) - [Value]
but with the change of year, it doesn't work
Any ideas?
Disclaimer - I created this solution using Power BI. The formula should transfer over to Excel/PowerPivot, but it may require some minor tweaks.
While the time intelligence functions of DAX are super useful, I seem to be drawn towards the manual approach of calculating YTD, prior year, etc. With that in mind, this is how I would solve your problem.
First, I would make a measure that is a simply sum of your Value column. Having this measure is just nice down the road; not an absolute necessity.
Total Value = SUM(Data[Value])
Next, for use to calculate YTD manually based of the prior month, we need to know two things, 1) what is the target month (i.e. the prior month) and 2) what is the year of that month.
If you are going to use those values anywhere else, I would put them into there own measure and use them here. If this is the only place they will be used, I like to use variables to calculate these kinds of values.
The first value we need is the selected (or max date in cases of no selection).
VAR SelectedDate = MAX(Data[Date])
With that date, we can calculate the TargetYear and TargetMonth. Both of them are simple IF statements to catch the January/December crossover.
VAR TargetYear = IF(MONTH(SelectedDate) = 1, YEAR(SelectedDate) - 1, YEAR(SelectedDate))
VAR TargetMonth = IF(MONTH(SelectedDate) = 1, 12, MONTH(SelectedDate) - 1)
Having all of the values we need, we write a CALCULATE statement that filters the data to the TargetYear and month is less than or equal to TargetMonth.
CALCULATE([Total Value], FILTER(ALL(Data), YEAR([Date]) = TargetYear && MONTH([Date]) <= TargetMonth))
Putting it all together looks like this.
Prior Month YTD =
VAR SelectedDate = MAX(Data[Date])
VAR TargetYear = IF(MONTH(SelectedDate) = 1, YEAR(SelectedDate) - 1, YEAR(SelectedDate))
VAR TargetMonth = IF(MONTH(SelectedDate) = 1, 12, MONTH(SelectedDate) - 1)
RETURN
CALCULATE([Total Value], FILTER(ALL(Data), YEAR([Date]) = TargetYear && MONTH([Date]) <= TargetMonth))

How can I count categorical columns by month in Pandas?

I have time series data with a column which can take a value A, B, or C.
An example of my data looks like this:
date,category
2017-01-01,A
2017-01-15,B
2017-01-20,A
2017-02-02,C
2017-02-03,A
2017-02-05,C
2017-02-08,C
I want to group my data by month and store both the sum of the count of A and the count of B in column a_or_b_count and the count of C in c_count.
I've tried several things, but the closest I've been able to do is to preprocess the data with the following function:
def preprocess(df):
# Remove everything more granular than day by splitting the stringified version of the date.
df['date'] = pd.to_datetime(df['date'].apply(lambda t: t.replace('\ufeff', '')), format="%Y-%m-%d")
# Set the time column as the index and drop redundant time column now that time is indexed. Do this op in-place.
df = df.set_index(df.date)
df.drop('date', inplace=True, axis=1)
# Group all events by (year, month) and count category by values.
counted_events = df.groupby([(df.index.year), (df.index.month)], as_index=True).category.value_counts()
counted_events.index.names = ["year", "month", "category"]
return counted_events
which gives me the following:
year month category
2017 1 A 2
B 1
2 C 3
A 1
The process to sum up all A's and B's would be quite manual since category becomes a part of the index in this case.
I'm an absolute pandas menace, so I'm likely making this much harder than it actually is. Can anyone give tips for how to achieve this grouping in pandas?
I tried this so posting though I like #Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0

Using BigQuery to find outliers with standard deviation results combined with WHERE clause

Standard deviation analysis can be a useful way to find outliers. Is there a way to incorporate the result of this query (finding the value of the fourth standard deviation away from the mean)...
SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as high FROM [publicdata:samples.natality];
result = 12.721342001626912
...Into another query that produces information about which states and dates have the most babies born heavier that 4 standard deviations from average?
SELECT state, year, month ,COUNT(*) AS outlier_count
FROM [publicdata:samples.natality]
WHERE
(weight_pounds > 12.721342001626912)
AND
(state != '' AND state IS NOT NULL)
GROUP BY state, year, month
ORDER BY outlier_count DESC;
Result:
Row state year month outlier_count
1 MD 1990 12 22
2 NY 1989 10 17
3 CA 1991 9 14
Essentially it would be great to combine this into a single query.
You can abuse JOIN for this (and thus performance will suffer):
SELECT n.state, n.year, n.month ,COUNT(*) AS outlier_count
FROM (
SELECT state, year, month, weight_pounds, 1 as key
FROM [publicdata:samples.natality]) as n
JOIN (
SELECT (AVG(weight_pounds) + STDDEV(weight_pounds) * 4) as giant_baby,
1 as key
FROM [publicdata:samples.natality]) as o
ON n.key = o.key
WHERE
(n.weight_pounds > o.giant_baby)
AND
(n.state != '' AND n.state IS NOT NULL)
GROUP BY n.state, n.year, n.month
ORDER BY outlier_count DESC;

Formula to calculate salary after x year?

Knowing that i am getting paid $10 000 a year, and that each year my salary increase by 5%.
What is the formula for Excel to know how much i will get in 5 year?
Thank you for any advise
The formula in Excel is:
=VF(5%;5;0;-10000)
Which results in: $12,762.82
If your Office is english version you can use:
=FV(5%;5;0;-10000)
=(10000*((1 + 0.05)^5))
The Compound Interest Equation
P = C (1 + r/n)^nt
Where...
P = Future Value
C = Initial Deposit/Salary
r = Interest Rate/Pay Increase (Expressed as a Fraction: EX = 0.05)
n = Number of Times per Year the Interest/Pay Raise Is Compounded (0 in Your Example)
t = Number of Years to Calculate
The Formula is POWER(B1,C1)*10000, and the cell B1 is (1+5% ), the cell C1 is the number of years
current = 10 000
for each year -1
current = current * 1.05
end for
current now has current salary for the given number of years

Resources