Anomaly Detection in option Theta values - python-3.x

I am currently working on a financial data problem. I want to detect trades for which anomalous theta values are being generated by the models (due to several factors).
My data mainly consists of trade with its profile variables like dealId, portfolio, etc. along with different theta values along with the theta components for different dates(dates back to 3 years).
Data that I am currently using looks like this:
Tradeid
Date1
Date 2 and so on
id1
1234
1238
id2
1289
1234
Currently, I am Tracking daily theta movement for all trades and sending trades whose theta has moved more than 20k (absolute value).
I want to build an ML model which tracks theta movement and detects that for the current date this(or these) particular deal id/s are having anomalous theta.
So far, I have tried clustering trades based on their theta movement correlation using DBSCAN with a distance matrix. I have also tried using Isolation forest but it is not generalizing very well on the dataset.
All the examples that I have seen so far for anomaly detection are more like finding a rotten apple from a bunch of apples. Is there any algorithm that would be best suitable for my case or can be modified to best suit my problem?

Your problem seems to be too simple for the machine learning world.
You can manually define a threshold, for which the data is anomalous and identify them.
And to do that, you can easily analyze your data using pandas to find out the mean, max, min etc. and then proceed to define a threshold.

Related

Is there a metric that can determine spatial and temporal proximity together?

Given a dataset which consists of geographic coordinates and the corresponding timestamps for each record, I want to know if there's any suitable measure that can determine the closeness between two points by taking the spatial and temporal distance into consideration.
The approaches I've tried so far includes implementing a distance measure between the two coordinate values and calculating the time difference separately. But in this case, I'd require two threshold values for both the spatial and temporal distances to determine their overall proximity.
I wanted to know there's any single function that can take in these values as an input together and give a single measure of their correlation. Ultimately, I want to be able to use this measure to cluster similar records together.

How to use excel data to find period

I have three Excel columns of data from an experiment with a pendulum: time, angle displacement, and angular velocity. I was wondering if there is a way in Excel to calculate and then graph the period (and, if possible, display the function for the graph)... I realize it's kinda a dumb question. I'm still new at Excel.
Thanks for any pointers u can give!
In case the Analysis ToolPak is installed, one can use Tools->Data Analysis->Fourier Analysis. If the data is a superposition of harmonic functions (sin,cos), the corresponding frequencies (or inverse periods) will appear as peaks in the Fourier analysis.

Correlation statistics

Naive Question:
In the attached snapshot, I am trying to figure out the correlation concept when applied to actual values and to calculation performed on those actual values and creating a new stream of data.
In the example,
Columns A,B,C,D,E have very different correlation but when I do a rolling sum on the same columns to get G,H,I,J,K the correlation is very much the same(negative or positive.
Are these to different types of correlation or am I missing out on something.
Thanks in advance!!
Yes, these are different correlations. It's similar to if you were to measure acceleration over time of 5 automobiles (your first piece of data) and correlate those accelerations. Each car accelerates at different rates over time leaving your correlation all over the place.
Your second set of data would be the velocity of each car at each point in time. Because each car is accelerating at a pretty constant rate (and doing so in two different directions from the starting point) you either get a big positive or big negative correlation.
It's not necessary that you get that big positive or big negative correlation in the second set, but since your data in each list is consistently positive or negative and grows at a consistent rate, it correlates with either similar lists.

Monte Carlo Simulation in Excel for Non-normal Distributions

I would like to simulate the performance a baseball player. I know his expected performance for every future year and the standard deviations of those performances (based on regression analysis). At first, I was thinking of using the NORMINV(RAND(),REF,REF) function in excel, but the underlying distribution of baseball players' performances is dramatically right skewed. Is there a way that I can perform this sort of analysis in Excel or some other free or low-cost software? The end-goal here is for the simulation to use the right skewed distribution. Thanks very much.
R has lots of tools to do this sort of analysis, though you'd have to look through the docs to figure out how to use it. R is free, at least for non-commercial use.
If you have a cumulative distribution table (that is evenly spaced and sufficiently detailed) then you can easily generate random values from this distribution in Excel by looking up a uniform random number generated by RAND() in your distribution table and take the corresponding "x-axis" value.
=OFFSET($A$1,MATCH(RAND(),$B$2:$B$102),0)
A1 is the cell just above the table of "x-axis" values.
B2:B102 is the cumulative distribution table.
This is a simplified example. Some small modifications may be needed to handle edge-cases and adjust for biases.
If you have enough empirical data you should be able to create the cumulative distribution table.

From one histogram, create a new histogram from just a mean or median?

Suppose I have a list of values that I can histogram and calculate descriptive statistics on such as mean, average, max, standard deviation, etc. Perhaps this histogram is bimodal or right skewed. Let’s call this group of data “DataSet1”.
Suppose I had just a mean or median of another set of data. Lets call that DataSet2. I do not have all the raw data for DataSet2, just the median or mean. There is a strong belief that DataSet1 and DataSet2 would show the same variability in values.
If I knew just a single value of either mean or median, can I apply the description statistics from DataSet1 to create a new histogram that mirrors the bimodal or right skewed behavior from DataSet1?
Thanks
Dan
Alternative intent:
I have 3 years of historical data, where the data definitely has a "day of week" trend to it. I am using a python api to apply seasonal ARIMA to forecast the next 7 days from the 3 years of historical data. The predicted value is great, but it is only 1 value. I would like to use that predicted value as the "mean" and create a histogram from the variability of values shown to exists historically by day of week.
so, today is thursday. Lets say i predict tomorrow to have a value of 78.6.
I want to sample potential values of tomorrow based upon a mean of 78.6 but with variability similar to that showed to exist on all historical fridays
If i look at historical fridays, perhaps it shows a skewed to the left behavior
so when i sample with a mean of 78.6, if i sampled 100 times, the values sampled, if plotted in a histogram, would also skew to the left
Hope that helps..

Resources