Getting end of period change with different index - python-3.x

I’ve got a Panel of DataFrames containing yahoo finance stocks.
I want to re-index the stocks by Date inside the Panel:
To be more specific:
What I mean in simple words is to get the difference from the beginning of the Q1 2013 for e.g. stock quote ‘AA’ from the end of the Quarter.
The expected result from:
AA[‘Close’][‘March28,2013’] - AA[‘Open’][‘January2,2013’]
is 8.52(march closing price)- 8.99(January closing price) = -0.47(difference).
I want to do this for all the Quarters that I have in single day data from Q1 2010 until Q3 2013 to get the difference. That is to change the index from daily to quarter.
What is the best way to do it?
Thanks everybody.

You should post what you expect your solution to look like so that I know I'm answering the correct question. I don't think the way I've interpreted your question makes sense. Edit your question to clarify and I'll edit my answer accordingly.
When you're working with Time Series data, there are some special resampling methods that are handy in cases like this.
Conceptually they're similar to groupby operations.
First of all, fetch the data:
pan = DataReader(['AAPL', 'GM'], data_source='yahoo')
For demonstration, focus on just AAPL.
df = pan.xs('AAPL', axis='minor')
In [24]: df.head()
Out[24]:
Open High Low Close Volume Adj Close
Date
2010-01-04 213.43 214.50 212.38 214.01 17633200 209.51
2010-01-05 214.60 215.59 213.25 214.38 21496600 209.87
2010-01-06 214.38 215.23 210.75 210.97 19720000 206.53
2010-01-07 211.75 212.00 209.05 210.58 17040400 206.15
2010-01-08 210.30 212.00 209.06 211.98 15986100 207.52
Now use the resample method to get to the frequency you're looking for: I'm going to demonstart quarterly, but you can substitue the appropriate code. We'll use BQS for *B*usiness *Q*uarterly *S*tart of quarter. To aggregate, we take the sum.
In [33]: df.resample('BQS', how='sum').head()
Out[33]:
Open High Low Close Volume Adj Close
Date
2010-01-01 12866.86 12989.62 12720.40 12862.16 1360687400 12591.73
2010-04-01 16083.11 16255.98 15791.50 16048.55 1682179900 15711.11
2010-07-01 16630.16 16801.60 16437.74 16633.93 1325312300 16284.18
2010-10-01 19929.19 20069.74 19775.96 19935.66 1025567800 19516.49
2011-01-03 21413.54 21584.60 21219.88 21432.36 1122998000 20981.76
Ok so now we want the Open for today minus the Close for yesterday for the gross change. Or (today / yesterday) - 1 for the percentage change. To do this, use the shift method, which shifts all the data down one row.
In [34]: df.resample('BQS', how='sum')['Open'] - df.resample('BQS', how='sum').shift()['Close']
Out[34]:
Date
2010-01-01 NaN
2010-04-01 3220.95
2010-07-01 581.61
2010-10-01 3295.26
2011-01-03 1477.88
2011-04-01 -119.37
2011-07-01 3058.69
2011-10-03 338.77
2012-01-02 6487.65
2012-04-02 5479.15
2012-07-02 3698.52
2012-10-01 -4367.70
2013-01-01 -7767.81
2013-04-01 -355.53
2013-07-01 -19491.64
Freq: BQS-JAN, dtype: float64
You could write a function and apply it to each DataFrame in the panel you get from Yahoo.

Related

Resample a distribution conditional on another value

I would like to create a series of simulated values by resampling from empirical observations. The data I have are time series of 1-minute frequency. The simulations should be made on an arbitrary number of days with the same times each day. The twist is, that I need to sample conditional on the time, i.e. when sampling for a time of 8:00, it should be more probable to sample a value around 8:00 (but not limited to 8:00) from the original serie.
I have made a small sketch to show, how the draw-distribution changes depending on which time the a value is simulated for:
I.e. for T=0 it is more probable to draw a value from the actual distribution where the time of day is close to 0 and not probable to draw a value from the original distribution at the time of day of T=n/2 or later, where n is the number of unique timestamps in a day.
Here is a code snippet to generate sample data (I am aware that there is no need to sample conditional on this test data, but it is just to show the structure of the data)
import numpy as np
import pandas as pd
# Create a test data frame (only for illustration)
df = pd.DataFrame(index=pd.date_range(start='2020-01-01', end='2020-12-31', freq='T'))
df['MyValue'] = np.random.normal(0, scale=1, size=len(df))
print(df)
MyValue
2020-01-01 00:00:00 0.635688
2020-01-01 00:01:00 0.246370
2020-01-01 00:02:00 1.424229
2020-01-01 00:03:00 0.173026
2020-01-01 00:04:00 -1.122581
...
2020-12-30 23:56:00 -0.331882
2020-12-30 23:57:00 -2.463465
2020-12-30 23:58:00 -0.039647
2020-12-30 23:59:00 0.906604
2020-12-31 00:00:00 -0.912604
[525601 rows x 1 columns]
# Objective: Create a new time series, where each time the values are
# drawn conditional on the time of the day
I have not been able to find an answer on here, that fits my requirements. All help are appreciated.
I consider this sentence:
need to sample conditional on the time, i.e. when sampling for a time of 8:00, it should be more probable to sample a value around 8:00 (but not limited to 8:00) from the original serie.
Then, assuming the standard deviation is one sixth of the day (given your drawing)
value = np.random.normal(loc=current_time_sample, scale=total_samples/6)

What is the best way to calculate month difference between two dates which have format like yyyyMM. in pyspark?

In my python df, I have columns MTH, old_dt
MTH old_dt
201901 2018-03-01
201902 2017-02-20
201903 2016-05-12
to calculate the month difference between two columns, I use the following python code
df['mth'] = pd.to_datetime(df['MTH'], format='%Y%m')
df=df.assign(
dif=
(df.mth.dt.year - df.old_dt.dt.year) * 12 +
(df.mth.dt.month - df.old_dt.dt.month)+1
)
The result will be integer, which is exactly what I want.
Now since my dataset is huge (more than 1 billion records), I decide to move to pyspark. but Not sure how does it work. I searched online and see a function month_difference, but it seems not look like what I want.
Thanks for any help, and Thanks Jens for the editting.
My expected output is :
MTH old_dt dif
201901 2018-03-01 11
201902 2017-02-20 25
201903 2016-05-12 35
will this work please? I was not able to open my AE to test
def mth_interval(df):
df = df.withColumn("mth", F.to_date('MTH', 'yyyyMM'))
df = df.withColumn('month_diff', ((F.year("mth")-F.year("old_dt")) *12+
(F.month("mth")-F.month("old_dt"))+1)
return df
thanks!
just tested and worked!

Filtering discrepancies in duplicate measurements

I have a dataset with the following problem.
Sometimes, a temperature sensor would return duplicate readings at the exact same minute, where sometimes 1 of 2 of the duplicates is "reasonable" and the other is slightly off.
For example:
TEMP TIME
1 24.5 4/1/18 2:00
2 24.7 4/1/18 2:00
3 24.6 4/1/18 2:05
4 28.3 4/1/18 2:05
5 24.3 4/1/18 2:10
6 24.5 4/1/18 2:10
7 26.5 4/1/18 2:15
8 24.4 4/1/18 2:15
9 24.7 4/1/18 2:20
10 22.0 4/1/18 2:20
Line 5, 7 & 10 are readings that are to be removed as they are too high or low (doesn't make sense that within 5 minutes it will rise and drop more than a degree in a relatively stable environment).
The goal at the end with this dataset is to "average" the similar values (such as in line 1 & 2) and just remove the lines that are too extreme (such as line 5 & 7) from the dataset entirely.
Currently my idea to formulate this is to look at a previously obtained row, and if one of the 2 duplicates is +/- 0.5 degree, to mark in a 3rd column with TRUE so I can filter out all the TRUE values in the end. I'm not sure how to communicate within the if statement that I'm looking for a + OR - 0.5 of a previous number however. Does anyone know?
Here is a google sheet example that does what you want:
https://docs.google.com/spreadsheets/d/1Va9RjSeulOfVTd-0b4EM4azbUkYUb22jXNc_EcafUO8/edit?usp=sharing
What I did:
Calculate a column of a 3-item running average of the data using "=AVERAGE(B3:B1)"
Filter the list using "=IF(ABS(B2-C2) < 1, B2, )"
Calculate the average of the filtered list
The use of Absolute Value is what provides "+ OR -" that you were looking for. It is saying if the distance between two numbers is too much, then don't include the term.
So, A Simple Solution came to my mind. Follow the Following steps given below:
Convert Data to Table
Add a 4th column at the last
Enter the formula "Current Value - Previous Value"
Filter the Column with high difference values
Delete those rows of filtered data and you'll be left with Normal Values
Here's the ref. Image
Or If you want to consider the Same time difference only then do the following:
Convert your data to Table
Add 4th column at the end of table
Writhe the Following Formula to 4th Column
IF(Current_Time = Previous_Time, Current_Temp-Previous_Temp,"")
Filter and Delete the Data with high Difference
See the following Image:

Apply a function on a pandas dataframe through a list of datetime windows

I'd like to calculate some aggregates over a pandas DataFrame. My DataFrame has no an uniform sample time but I want to calculate aggregates over uniform sample times.
For example: I want to calculate the mean of the last 7 days, each day.
So I want the resulting dataframe to have a datetime index daily.
I tried to implement it with a map-reduce structure but it was 40 times slower than a pandas implementation (using some specific cases).
My question is, do you know if there is any way to do this using pandas built-in functions?
An input DataFrame could be:
a b c
2010-01-01 00:00:00 0.957828 0.784962 0.801670
2010-01-01 00:00:06 0.669214 0.484439 0.479857
2010-01-01 00:00:18 0.537689 0.222179 0.995624
2010-01-01 00:01:15 0.339822 0.787626 0.144389
2010-01-01 00:02:21 0.705167 0.163373 0.317012
... ... ... ...
2010-03-14 23:57:35 0.799490 0.692932 0.050606
2010-03-14 23:58:40 0.380406 0.825227 0.643480
2010-03-14 23:58:43 0.838390 0.701595 0.632378
2010-03-14 23:59:01 0.604610 0.965274 0.503141
2010-03-14 23:59:02 0.320855 0.937064 0.192669
And the function should output something like this:
a b c
2010-01-01 0.957828 0.784962 0.801670
2010-01-02 0.499331 0.499944 0.505271
2010-01-03 0.499731 0.498455 0.503352
2010-01-04 0.499632 0.499328 0.502895
2010-01-05 0.500119 0.500299 0.502169
... ... ... ...
2010-03-10 0.499813 0.499680 0.501154
2010-03-11 0.499965 0.500226 0.501582
2010-03-12 0.500644 0.500720 0.501243
2010-03-13 0.500439 0.500264 0.501203
2010-03-14 0.500776 0.500334 0.501048
Where each value corresponds to the mean of the last 7 days.
I know it's an old thread but maybe this answer would be helpful to someone.
I think it's:
df.rolling(window=your_window_value).mean() # Simple Moving Average
That could be used.
But it will work only if for each day you have the same number of rows. In such case you could use window value equal to 7 x number of rows per day.
If for each day you have different number of rows then we would probably need different solution. But scenario was not fully defined in the question...

how to count of issue with open status in spotfire

I need to calculate count of issue ID for each month with open status.
I have below 3 columns-
Issue_ID
Issue_Open_Date
Issue_Closed_Date
Issue_ID Issue_Open_Date Issue_Closed_Date Open_Issue_Count(required output)
IS_10 11/11/2014 1/5/2015 3
IS_11 11/12/2014 12/14/2014
IS_12 11/13/2014 11/15/2014
IS_13 11/14/2014 3/5/2015
IS_1 12/1/2014 12/15/2014 4
IS_2 12/2/2014 2/10/2015
IS_3 12/3/2014 1/15/2015
IS_4 1/1/2015 2/10/2015 4
IS_5 1/2/2015 3/11/2015
IS_6 1/3/2015 1/22/2015
IS_7 2/1/2015 3/5/2015 3
IS_8 2/2/2015 2/2/2015
IS_9 2/7/2015 2/28/2015
IS_14 3/1/2015 4/5/2015 1
Based on above table, i need a count of open status of each month.
lets suppose in December i need to count than it should check in dec and nov month.
If any issue is closing in same month, it mean that is not in open stage,
Basically for each month it should check for their records also and previous month records also.
Required output is below-
Nov- 3
Dec- 4
Jan-4
Feb-3
march-1
So... I have a way but it's ugly. I'm sure there's a better way but I spent a while banging my head on this trying to make it work just within Spotfire without resorting to a python script looping through rows and making comparisons.
With nested aggregated case statements in a Cross Table I made it work. It's a pain in the butt because it's pretty manual (have to add each month) but it will look for things that have a close date after the month given and an open date that month or earlier.
<
Sum(Case
when ([Issue_Closed_Date]>Date(2014,11,30)) AND ([Issue_Open_Date]<Date(2014,12,1)) then 1 else 0 end) as [NOV14_OPEN] NEST
Sum(Case
when ([Issue_Closed_Date]>Date(2014,12,31)) AND ([Issue_Open_Date]<Date(2015,1,1)) then 1 else 0 end) as [DEC14_OPEN] NEST
Sum(Case
when ([Issue_Closed_Date]>Date(2015,1,31)) AND ([Issue_Open_Date]<Date(2015,2,1)) then 1 else 0 end) as [JAN15_OPEN] NEST
Sum(Case
when ([Issue_Closed_Date]>Date(2015,2,28)) AND ([Issue_Open_Date]<Date(2015,3,1)) then 1 else 0 end) as [FEB15_OPEN] NEST
Sum(Case
when ([Issue_Closed_Date]>Date(2015,3,31)) AND ([Issue_Open_Date]<Date(2015,4,1)) then 1 else 0 end) as [MAR15_OPEN]>
Screenshot:
As far as doing it with python you could probably loop through the data and do the comparisons and save it as a data table. If I'm feeling ambitious this weekend I might give it a try out of personal curiosity. I'll post here if so.
I think what makes this difficult is that it's not very logical to add a column showing number of issues open at a point in time because the data doesn't show time; it's "one row per unique issue."
I don't know what your end result should be, but you might be better off unpivoting the table.
unpivot the above data with the following settings:
pass through: [Issue_ID]
transform: [Issue_Open_Date], [Issue_Closed_Date]
optionally rename Category as "Action" and Value as "Action Date"
now that each row represents one action, create a calculated column assigning a numeric value to the action with the following formula.
CASE [Action]
WHEN "Issue_Open_Date" THEN 1
WHEN "Issue_Closed_Date" THEN -1
END
create a bar chart with [Action Date] along the X axis (I wouldn't drill further than month or week) and the following on the Y axis:
Sum([Action Numeric]) over (AllPrevious([Axis.X]))
you'll wind up with something like this:
you can then do all sorts of fancy things with this data, such as show a line chart with the rate at which cases open and close (you can even plot this on a combination chart with the pictured example).

Resources